a perfect hybrid

A Perfect Hybrid

Split query processing in Polybase

[email protected]

2013/4/25

Outline• Background• Related Work• Overview of Polybase• Architecture of PDW• Polybase Implementation• Performance Evaluation

Background• Structured data & unstructured data

• RDBMS & Big Data

RDBMSHadoop

Combine

Insight

Related Work• Sqoop: Transferring bulk data between Hadoop

and structured data stores such as relational database

• Teradata & Asterdata• Greenplum & Vertica : external table• Oracle: external table and OLH(Oracle loader for

Hadoop)• IBM: split mechanism to use mapreduce to access

appliance• Hadapt(HadoopDB): outset to support the

execution of SQL-like queries across both unstructured and structured data sets.

Overview of Polybase• A new feature in PDW V2

o Using SQL standard languageo Dealing with both structured and unstructured data(in SQL Server

and Hadoop)o leverages the capabilities of SQL Server PDW, especially it's cost-

based parallel query optimizer and execution engine.

PDW Architecture• Parallel Data Warehouse

• Shared-nothing system

Components in PDW• Node

o SQL server instance on nodeo Data are hash-partitioned through compute nodeo Control node: [PWD Engine in it]

• query parsing• Optimization• creating distributed execution plan to compute nodes(DSQL)• tracking execution steps of plan to compute nodes

o Compute node:• Storage• Query processing

• DMS: Data Movement Service o (1)repartitioning rows of a table among the SQL Server instances on

PDW compute nodes.o (2)converting fields of rows being loaded into appliance into the

appropriate ODBC types.

Components in Polybase

• External Tableo Make data in Hadoop visible to SQL Servero Content: namenode, namenode port, jobtracker, jobtracker port, input

format class, types in table.

• HDFS Bridge in DMSo Transferring data in parallel between the nodes of Hadoop and PDW

clusters.

• Cost-based query optimizero wrapping the one in PDW Version1

HDFS Bridge

HDFS Bridge• HDFS shuffle phase: (read data from hadoop)

o 1. Communicate with namenode, get info of fileo 2. Balance number of bytes read by each DMS instance(based on

HDFS info and DMSinstances count)o 3. Invoke openRecordReader() RecordReader instance: directly

communicate with datanodeo 4. Get data and transfer into ODBC types.(may done in MapReduce job)o 5. Hash function to determine target node for each record

• Write to Hadoop is almost the same

Query Optimizer • Parsing

o A Memo data structure of alternative serial plans

• Parallel optimization[in PDW V1]o Bottom-up optimizer to insert data movement operators in the serial

plans

• Cost-based query optimizer:[whether pushing to Hadoop]o Based on

• statistics• relative size of two clusters and other factors

Statistics• Define statistics table for external table

o CREATE STATISTICS hdfsCustomerStats ONo hdfsCustomer (c_custkey);

• Steps to obtain statistics in HDFSo 1. Read block level sample data from DMS or map jobso 2. Partitioned samples across compute nodes.o 3. Each node calculates a histogram on its portiono 4. Merge all histograms stored in catalog for database.

• An alternative implementationo In HadoopV2, let Hadoop cluster calculate the histograms. (cost a lot)o Make the best use of computational resource of Hadoop cluster

Else• Semantic Compatibility

o Data types• Java primitive types• Non-primitive types• Third-party types that can be implemented• Marked those can not be implemented in Java[only can be

processed in PDW]

• SQL semantics• Return of Expressions: implemented in Java• Returning null: eg. A+B (A==null || B==null)?null: (A+B)• Marked those can not be implemented in Java[only can be

processed in PDW]

• Error handling• Exceptions will come out in SQL should also be throwed in Java

Example• SELECT count (*) from Customer

WHERE acctbal < 0GROUP BY nationkey

Optimized Query Plan #1

Optimized Query Plan #2

MapReduce Join• Distributed Hash Join

o Support for equi-join

• Implementation: o Build side: the side with smaller size of data. They are materialized in

HDFS.o Probe side: the other side of data. o Partition build side, making build side in-memory to speed up.o Build side may also be replicated.

Performance Evaluation

• Test configuration:o C-16/48 16 node PDW cluster, 48 node Hadoop clustero C-30/30 30 node PDW cluster, 30 node Hadoop clustero C-60 60 node PDW cluster and 60 node Hadoop cluster

• Test database:o Two identical tables T1 and T2

• 10 billion rows• 13 integer attributes and 3 string attributes (~200 bytes/row)• About 2TB uncompressed

o One copy of each table in HDFS• HDFS block size of 256 MB• Stored as a compressed RCFile• RCFiles store rows “column wise” inside a block

o One copy of each table in PDW• Block-wise compression enabled

SELECT u1, u2, u3, str1, str2, str4 from T1 (in HDFS) where (u1 % 100) < sf

Selection on HDFS table

1 20 40 60 80 1000

500

1000

1500

2000

2500

PDWImportMR

Selectivity Factor (%)

Exe

cu

tion

Tim

e (

secs.

)

Polybase Phase 2

Polybase Phase 1

Crossover Point:Above a selectivity factor of ~80%, PB Phase 2 is slower

PB.1

SP

PB.1 PB.1 PB.1 PB.1 PB.1

SP

SP

SP

SP

SP

19

Join HDFS Table with PDW Table

SELECT * from T1 (HDFS), T2 (PDW) where T1.u1 = T2.u2 and (T1.u2 % 100) < sf and (T2.u2 % 100) < 50

1 33 66 1000

500

1000

1500

2000

2500

3000

3500

PDW

Import

MR

Selectivity Factor (%)

Exe

cu

tion

Tim

e (

secs.

)

Polybase Phase 1

Polybase Phase 2

PB.1

SP

SP

SP

SP

PB.1 PB.1 PB.1

20

Join Two HDFS TablesSELECT * from T1 (HDFS), T2 (HDFS) where T1.u1 = T2.u2 and (T1.u2 % 100) < SF and (T2.u2 % 100) < 10

PB

.1

PB

.2H

PB

.2P

PB

.2H

PB

.2H

PB

.1

PB

.1

PB

.1

1 33 66 1000

500

1000

1500

2000

2500

PDW

Import-Join

MR-Shuffle-J

MR-Shuffle

Import T2

Import T1

MR- Sel T2

MR-Sel T1Selectivity Factor

Execu

tion

Tim

e (

secs.)

PB

.2H

PB

.2P

PB

.2P

PB

.2P

PB.2P – Selections on T1 and T2 pushed to Hadoop. Join performed on PDWPB.1 – All operators on PDWPB.2H – Selections & Join on Hadoop

21

Performance Wrap-up• Split query processing really works!• Up to 10X performance improvement!• A cost-based optimizer is clearly required to

decide when an operator should be pushed• Optimizer must also incorporate relative cluster

sizes in its decisions

Reference• Split Query Processing in Polybase(SIGMOD’13 ， June 22-

27,2013,New York,USA.)o Microsoft Corporation

• Polybase: What, Why, How(ppt)o Microsoft Corporation

• Query Optimization in Microsoft SQL Server PDW (SIGMOD'12, May 20-24,2012,Scottsdale,Arizona,USA)o Microsoft Corporation

THANKS

a perfect hybrid

Documents

bulk data

dmstransferring data

structured data stores

structured data sets

data movement service

data movement operators

execution of sql

pdw compute nodes