polybase in actionfiles.informatandm.com/uploads/2018/10/polybase_in... · 2018-10-10 · • sql...

80
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM Polybase In Action #ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM Kevin Feasel Engineering Manager, Predictive Analytics ChannelAdvisor

Upload: others

Post on 12-Jan-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Polybase In Action

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Kevin FeaselEngineering Manager, Predictive Analytics

ChannelAdvisor

Page 2: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Who Am I? What Am I Doing Here?

Catallaxy Services

Curated SQL

We Speak Linux@feaselkl

Page 3: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Polybase

Polybase is Microsoft's newest technology for connecting to remote servers.

It started by letting you connect to Hadoop and has expanded since then to include Azure Blob Storage. Polybase is also the best method to load data into Azure SQL Data Warehouse.

Page 4: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Polybase Targets

• SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS)

• SQL Server to Azure Blob Storage• Azure Blob Storage to Azure SQL Data Warehouse

In all three cases, you can use the T-SQL you know rather than a similar SQL-like language (e.g., HiveQL, SparkSQL, etc.) or some completely different language.

Page 5: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Polybase Targets – SQL Server 2019

• SQL Server to SQL Server• SQL Server to Oracle• SQL Server to MongoDB• SQL Server to Teradata• SQL Server to ODBC (e.g., Spark)

Page 6: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Massive Parallel Processing

Polybase extends the idea of Massively Parallel Processing (MPP) to SQL Server. SQL Server is a classic "scale-up" technology: if you want more power, add more RAM/CPUs/resources to the single server.

Hadoop is a great example of an MPP system: if you want more power, add more servers; the system will coordinate processing.

Page 7: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Why MPP?

It is cheaper to scale out than scale up: 10 systems with 256 GB of RAM and 8 cores is a lot cheaper than a system with 2.5 TB of RAM and 80 cores.At the limit, you eventually run out of room to scale up, but scale out is much more practical: you can scale out to 2 petabytes of RAM but good luck finding a single server that supports this amount!There is additional complexity involved, but MPP systems let you move beyond the power of a single server.

Page 8: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Polybase As MPP

MPP requires a head node and 1 or more compute nodes. Polybase lets you use SQL Servers as the head and compute nodes. Scale-out servers must be on an Active Directory domain. The head node must be Enterprise Edition, though the compute nodes can be Standard Edition.Polybase lets SQL Server compute nodes talk directly to Hadoop data nodes, perform aggregations, and then return results to the head node. This removes the classic SQL Server single point of contention.

Page 9: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Timeline

• Introduced in SQL Server Parallel Data Warehouse (PDW) edition, back in 2010

• Expanded in SQL Server Analytics Platform System (APS) in 2012.

• Released to the "general public" in SQL Server 2016, with most support being in Enterprise Edition.

• Extended support for additional technologies (like Oracle, MongoDB, etc.) will be available in SQL Server 2019.

Page 10: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Motivation

Today's talk will focus on using Polybase to integrate SQL Server 2016/2017 with Hadoop and Azure Blob Storage.

We will use a couple smaller data sources to give you an idea of how Polybase works. Despite the size of the demos, Polybase works best with a significant number of compute nodes and Hadoop works best with a significant number of data nodes.

Page 11: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Installation Pre-Requisites

1. SQL Server 2016 or later, Enterprise or Developer Edition2. Java Runtime Environment 7 Update 51 or later (get the

latest version of 8 or 9; using JRE 9 requires SQL Server 2017 CU4)

3. Machines must be on a domain if you want to use scale-out

4. Polybase may only be installed once per server. If you have multiple instances, choose one. You can enable on multiple VMs, however.

Page 12: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Installation

Select the “New SQL Server stand-alone installation” link in the SQL Server installer:

Page 13: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Installation

When you get to feature selection, check the “PolyBase Query Service for External Data” box:

Page 14: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Installation

If you get the following error, you didn’t install the Java Runtime Environment.

If you have JRE 9, you need SQL Server 2017 CU4 or later for SQL Server to recognize this.

Page 15: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Installation

For standalone installation, select the first radio button. This selection does not require your machine be connected to a domain.

Page 16: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Installation

The Polybase engine and data movement service accounts are NETWORK SERVICE accounts by default. There are no virtual accounts for Polybase.

Page 17: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Installation

After installation is complete, run the following against the SQL Server instance:

sp_configure @configname = 'hadoop connectivity', @configvalue = 7;GORECONFIGUREGO

Set the value to 6 for Cloudera’s Hadoop distribution, or 7 for Hortonworks or Azure Blob Storage.

Page 18: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Hadoop Configuration

First, we need to make sure our Hadoop and SQL Server configuration settings are in sync.

We need to modify the yarn-site.xml and mapred-site.xml configuration files.

If you do not do this correctly, then MapReduce jobs will fail!

Page 19: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Hadoop Configuration

You will need to find your Hadoop configuration folder that came as part of the Polybase installation. By default, that is at:

C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\Binn\Polybase\Hadoop\conf

Inside this folder, there are two files we care about.

Page 20: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Hadoop Configuration

Next, go looking for your Hadoop installation directory. On HDP, you'll find it at:

/usr/hdp/[version]/hadoop/conf/

Note that the Polybase docs use /usr/hdp/current, but this is a bunch of symlinks with the wrong directory structure.

Page 21: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Hadoop Configuration

Modify yarn-site.xml and change the yarn.application.classpath property. For the Hortonworks distribution of Hadoop (HDP), you’ll see a series of values like:

<value>/usr/hdp/2.4.3.0-227/hadoop/*,/usr/hdp/2.4.3.0-227/hadoop/lib/*, …</value>

Replace 2.4.3.0-227 with your HDP version.

Page 22: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Hadoop Configuration

Include the following snippet in your mapred-site.xml file:<property><name>yarn.app.mapreduce.am.staging-dir</name><value>/user</value></property><property><name>mapreduce.jobhistory.done-dir</name><value>/mr-history/done</value></property><property><name>mapreduce.jobhistory.intermediate-done-dir</name><value>/mr-history/tmp</value></property>

Without this configured, you will be unable to perform MapReduce operations on Hadoop.

Page 23: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Polybase Basics

In this section, we will look at three new constructs that Polybase introduces: external data sources, external file formats, and external tables.

Page 24: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

External Data Source

External data sources allow you to point to another system. There are several external data sources, and we will look at two today.

Page 25: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

External Data Source

CREATE EXTERNAL DATA SOURCE [HDP] WITH(

TYPE = HADOOP,LOCATION = N'hdfs://sandbox.hortonworks.com:8020',RESOURCE_MANAGER_LOCATION =

N'sandbox.hortonworks.com:8050')

The LOCATION is the NameNode port and is needed for Hadoop filesystem operations. RESOURCE_MANAGER_LOCATION is the YARN port and is needed for predicate pushdown.

Page 26: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

External File Format

External file formats explain the structure of a data set. There are several file formats available to us.

Page 27: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

External File Format: Delimited File

Delimited files are the simplest to understand but tend to be the least efficient.CREATE EXTERNAL FILE FORMAT file_format_nameWITH (

FORMAT_TYPE = DELIMITEDTEXT [ , FORMAT_OPTIONS ( <format_options> [ ,...n ] ) ] [ , DATA_COMPRESSION = {

'org.apache.hadoop.io.compress.GzipCodec' | 'org.apache.hadoop.io.compress.DefaultCodec'

} ]);

Page 28: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

External File Format: Delimited File

<format_options> ::= {

FIELD_TERMINATOR = field_terminator| STRING_DELIMITER = string_delimiter| DATE_FORMAT = datetime_format| USE_TYPE_DEFAULT = { TRUE | FALSE }

} </format_options>

Page 29: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

External File Format: RCFile

Record Columnar files are an early form of columnar storage.

CREATE EXTERNAL FILE FORMAT file_format_nameWITH (

FORMAT_TYPE = RCFILE, SERDE_METHOD = {

'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe' | 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'

} [ , DATA_COMPRESSION =

'org.apache.hadoop.io.compress.DefaultCodec' ]);

Page 30: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

External File Format: ORC

Optimized Row Columnar files are strictly superior to RCFile.

CREATE EXTERNAL FILE FORMAT file_format_nameWITH (

FORMAT_TYPE = ORC [ , DATA_COMPRESSION = {

'org.apache.hadoop.io.compress.SnappyCodec' | 'org.apache.hadoop.io.compress.DefaultCodec' }

]);

Page 31: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

External File Format: Parquet

Parquet files are also columnar. Cloudera prefers Parquet, whereas Hortonworks prefers ORC.

CREATE EXTERNAL FILE FORMAT file_format_nameWITH (

FORMAT_TYPE = PARQUET [ , DATA_COMPRESSION = {

'org.apache.hadoop.io.compress.SnappyCodec' | 'org.apache.hadoop.io.compress.GzipCodec' }

]);

Page 32: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

External File Format: ComparisonMethod Good Bad Best Uses

Delimited Easy to use Less efficient, slower performance

Easy Mode

RC File Columnar Strictly superior options Don’t use this

ORC Great agg perf Columnar not always a good fit; slower to write

Non-nested files with aggregations of subsets of columns

Parquet Great agg perf Columnar not always a good fit; often larger than ORC

Nested data

Page 33: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

External Tables

External tables use external data sources and external file formats to point to some external resource and visualize it as a table.

Page 34: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

External Tables

CREATE EXTERNAL TABLE [dbo].[SecondBasemen](

[FirstName] [VARCHAR](50) NULL,[LastName] [VARCHAR](50) NULL,[Age] [INT] NULL,[Throws] [VARCHAR](5) NULL,[Bats] [VARCHAR](5) NULL

)WITH(

DATA_SOURCE = [HDP],LOCATION = N'/tmp/ootp/secondbasemen.csv',FILE_FORMAT = [TextFileFormat],REJECT_TYPE = VALUE,REJECT_VALUE = 5

);

Page 35: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

External Tables

External tables appear to end users just like normal tables: they have a two-part schema and even show up in Management Studio, though in an External Tables folder.

Page 36: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Demo Time

Page 37: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Querying Hadoop

Once we have created an external table, we can write queries against it just like any other table.

Page 38: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Demo Time

Page 39: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Querying Hadoop – MapReduce

In order for us to be able to perform a MapReduce operation, we need the external data source to be set up with a resource manager. We also need one of the two:

1. The internal cost must be high enough (based on external table statistics) to run a MapReduce job.

2. We force a MapReduce job by using the OPTION(FORCE EXTERNALPUSHDOWN) query hint.

Note that there is no "cost threshold for MapReduce," so the non-forced decision is entirely under the Polybase engine's control.

Page 40: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Querying Hadoop – MapReduce

Functionally, MapReduce queries operate the same as basic queries. Aside from the query hint, there is no special syntax for MapReduce operations and end users don't need to think about it.

WARNING: if you are playing along at home, your Hadoop sandbox should have at least 12 GB of RAM allocated to it. This is because Polybase creates several 1.5 GB containers on top of memory requirements for other Hadoop services.

Page 41: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Demo Time

Page 42: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Querying Hadoop – Statistics

Although external tables have none of their data stored on SQL Server, the database optimizer can still make smart decisions by using statistics.

Page 43: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Querying Hadoop – Statistics

Important notes regarding statistics:

1. Stats are not auto-created.2. Stats are not auto-updated.3. The only way to update stats is to drop and re-create the

stats.4. SQL Server generates stats by bringing the data over, so

you must have enough disk space! If you sample, you only need to bring that percent of rows down.

Page 44: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Querying Hadoop – Statistics

Statistics are stored in the same location as any other table's statistics, and the optimizer uses them the same way.

Page 45: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Demo Time

Page 46: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Querying Hadoop – Data Insertion

Not only can we select data from Hadoop, we can also write data to Hadoop.

We are limited to INSERT operations. We cannot update or delete data using Polybase.

Page 47: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Demo Time

Page 48: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Azure Blob Storage

Hadoop is not the only data source we can integrate with using Polybase. We can also insert and read data in Azure Blob Storage.

The basic constructs of external data source, external file format, and external table are the same, though some of the options are different.

Page 49: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Azure Blob Storage

Create an external data source along with a scoped database credential (for secure access to this blob):CREATE MASTER KEY ENCRYPTION BY PASSWORD = '{password}';GOCREATE DATABASE SCOPED CREDENTIAL AzureStorageCredentialWITH IDENTITY = 'cspolybase',SECRET = '{access key}';GOCREATE EXTERNAL DATA SOURCE WASBFlightsWITH (

TYPE = HADOOP, LOCATION = 'wasbs://[email protected]',CREDENTIAL = AzureStorageCredential

);

Page 50: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Azure Blob Storage

External file formats are the same between Hadoop and Azure Blob Storage.

CREATE EXTERNAL FILE FORMAT [CsvFileFormat] WITH(

FORMAT_TYPE = DELIMITEDTEXT,FORMAT_OPTIONS(

FIELD_TERMINATOR = N',',USE_TYPE_DEFAULT = True

));

Page 51: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Azure Blob Storage

External tables are similar to Hadoop as well.CREATE EXTERNAL TABLE [dbo].[Flights2008](...)WITH(

LOCATION = N'historical/2008.csv.bz2',DATA_SOURCE = WASBFlights,FILE_FORMAT = CsvFileFormat,-- Up to 5000 rows can have bad values before Polybase

returns an error.REJECT_TYPE = Value,REJECT_VALUE = 5000

);

Page 52: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Azure Blob Storage

Start with a set of files:

Page 53: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Azure Blob Storage

After creating the table, select top 10:

Page 54: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Azure Blob Storage

Running an expensive aggregation query:

SELECTfa.[year],COUNT(1) AS NumberOfRecords

FROM dbo.FlightsAll faGROUP BY

fa.[year]ORDER BY

fa.[year];

Page 55: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Azure Blob Storage

While we're running the expensive aggregation query, we can see that the mpdwsvc app chews up CPU and memory:

Page 56: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Azure Blob Storage

Create a table for writing to blob storage:

CREATE EXTERNAL TABLE [dbo].[SecondBasemenWASB](...)WITH(

DATA_SOURCE = [WASBFlights],LOCATION = N'ootp/',FILE_FORMAT = [CsvFileFormat],REJECT_TYPE = VALUE,REJECT_VALUE = 5

)

Page 57: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Azure Blob Storage

Insert into the table:

INSERT INTO dbo.SecondBasemenWASBSELECT

sb.FirstName,sb.LastName,sb.Age,sb.Bats,sb.Throws

FROM Player.SecondBasemen sb;

Page 58: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Azure Blob Storage

Eight files are created:

Page 59: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Azure Blob Storage

Multiple uploadscreate separatefile sets:

Page 60: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – Azure SQL DW

Polybase features prominently in Azure SQL Data Warehouse, as Polybase is the best method for getting data into an Azure SQL DW cluster.

Page 61: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – Azure SQL DW

Access via SQL Server Data Tools,not SSMS:

Page 62: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – Azure SQL DW

Once connected, we cansee the database.

Page 63: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – Azure SQL DW

Create an external data source to Blob Storage:CREATE MASTER KEY ENCRYPTION BY PASSWORD = '{password}';GOCREATE DATABASE SCOPED CREDENTIAL AzureStorageCredentialWITH IDENTITY = 'cspolybase',SECRET = '{access key}';GOCREATE EXTERNAL DATA SOURCE WASBFlightsWITH (

TYPE = HADOOP,LOCATION =

'wasbs://[email protected]',CREDENTIAL = AzureStorageCredential

);

Page 64: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – Azure SQL DW

Create an external file format:

CREATE EXTERNAL FILE FORMAT [CsvFileFormat] WITH(

FORMAT_TYPE = DELIMITEDTEXT,FORMAT_OPTIONS(

FIELD_TERMINATOR = N',',USE_TYPE_DEFAULT = True

));GO

Page 65: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – Azure SQL DW

Create an external table:CREATE EXTERNAL TABLE [dbo].[Flights2008] (...)WITH(

LOCATION = N'historical/2008.csv.bz2',DATA_SOURCE = WASBFlights,FILE_FORMAT = CsvFileFormat,-- Up to 5000 rows can have bad values before

Polybase returns an error.REJECT_TYPE = Value,REJECT_VALUE = 5000

);

Page 66: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – Azure SQL DW

Blob Storage data retrieval isn’t snappy:

Page 67: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – Azure SQL DW

Use CTAS syntax to create an Azure SQL DW table:

CREATE TABLE[dbo].[Flights2008DW]WITH(

CLUSTERED COLUMNSTORE INDEX,DISTRIBUTION = HASH(tailnum)

)AS SELECT * FROM dbo.Flights2008;GO

Page 68: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – Azure SQL DW

Azure SQL Data Warehouse data retrieval is snappy:

Page 69: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – Azure SQL DW

We can export data to Azure Blob Storage:CREATE EXTERNAL TABLE [dbo].[CMHFlights]WITH(

LOCATION = N'columbus/',DATA_SOURCE = WASBFlights,FILE_FORMAT = CsvFileFormat,-- Up to 5000 rows can have bad values before Polybase

returns an error.REJECT_TYPE = Value,REJECT_VALUE = 5000

)AS SELECT * FROM dbo.FlightsAllDW WHERE dest = 'CMH';GO

Page 70: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – Azure SQL DW

This CETAS syntax lets us write out the result set:

Page 71: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – Azure SQL DW

CETAS created 60 files,1 for each Azure DWcompute node:

Page 72: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – Data Lake

Polybase can only read from Azure Data Lake Storage if you are pulling data into Azure SQL Data Warehouse.

The general recommendation for SQL Server is to use U-SQL and Azure Data Lake Services to pull data someplace where SQL Server can read the data.

Page 73: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – HDInsight

Polybase is not supported in Azure HDInsight. Polybase requires access to ports that are not available in an HDInsight cluster.

The general recommendation is to use Azure Blob Storage as an intermediary between SQL Server and HDInsight.

Page 74: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Other Azure Offerings – SQL DB

Polybase concepts like external tables drive Azure SQL Database's cross-database support.

Despite this, we can not use Polybase to connect to Hadoop or Azure Blob Storage via Azure SQL Database.

Page 75: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Issues -- Docker

Polybase has significant issues connecting to DockerizedHadoop nodes. For this reason, I do not recommend using the HDP 2.5 or 2.6 sandboxes, either in the Azure marketplace or on-prem.

Instead, I recommend building your own Hadoop VM or machine using Ambari.

Page 76: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Issues -- MapReduce

Current-gen Polybase supports direct file access and MapReduce jobs in Hadoop. It does not support connecting to Hive warehouses, using Tez, or using Spark.

Because Polybase's Hadoop connector does not support these, it must fall back on a relatively slow method for data access.

Page 77: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Issues– File Formats

Polybase only supports files without in-text newlines. This makes it impractical for parsing long text columns which may include newlines.

Polybase is limited in its file format support and does not support the Avro file format, which is a superior rowstore data format.

Page 78: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Big Data Clusters – Next-Gen Polybase

The next generation of Polybase involves being able to connect to Oracle, Elasticsearch, MongoDB, Teradata, and anything with an ODBC interface (like Spark).

This is now available in the SQL Server 2019 public preview. The goal is to make SQL Server a “data hub” for interaction with various technologies and systems, with SQL Server as a virtualization layer.

Page 79: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

For Further Thought…

Some interesting uses of Polybase:

• Hot/Cold partitioned views• Hadoop-based data lake enriched by SQL data• "Glacial" data in Azure Blob Storage• Replacement for linked servers (with Polybase vNext)

Page 80: Polybase In Actionfiles.informatandm.com/uploads/2018/10/Polybase_In... · 2018-10-10 · • SQL Server to Hadoop (Hortonworks or Cloudera, on-prem or IaaS) • SQL Server to Azure

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

Wrapping Up

Polybase was one of the key SQL Server 2016 features. There is still room for growth (and a team hard at work), but it is a great integration point between SQL Server and Hadoop / Azure Blob Storage.

To learn more, go here: https://CSmore.info/on/polybase

And for help, contact me: [email protected] | @feaselkl