a perfect hybrid

28
A Perfect Hybrid Split query processing in Polybase biaobiaoqi [email protected] 2013/4/25

Upload: sagittarius-lucas

Post on 31-Dec-2015

31 views

Category:

Documents


2 download

DESCRIPTION

A Perfect Hybrid. Split query processing in Polybase. biaobiaoqi [email protected] 2013/4/25. Outline. Background Related Work PDW Polybase Performance Evaluation. Background. Structured data & unstructured data RDBMS & Big Data. RDBMS. Combine. Insight. Hadoop. Related Work. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Perfect Hybrid

A Perfect Hybrid

Split query processing in Polybase

[email protected]

2013/4/25

Page 2: A Perfect Hybrid

Outline• Background• Related Work• PDW• Polybase• Performance Evaluation

Page 3: A Perfect Hybrid

Background• Structured data & unstructured data

• RDBMS & Big Data

RDBMSHadoop

Combine

Insight

Page 4: A Perfect Hybrid

Related Work• Sqoop: Transferring bulk data between Hadoop

and structured data stores such as relational database

• Teradata & Asterdata• Greenplum & Vertica : external table• Oracle: external table and OLH(Oracle loader for

Hadoop)• IBM: split mechanism to use mapreduce to access

appliance• Hadapt(HadoopDB): outset to support the

execution of SQL-like queries across both unstructured and structured data sets.

Page 5: A Perfect Hybrid

PDW Architecture• Parallel Data Warehouse

• Shared-nothing system

Page 6: A Perfect Hybrid

Components in PDW• Node

o SQL server instance on nodeo Data are hash-partitioned through compute node

• Control node: [PWD Engine in it]o query parsingo Optimizationo creating distributed execution plan to compute nodes(DSQL)o tracking execution steps of plan to compute nodes

• Compute node:o Storageo Query processing

• DMS: Data Movement Service o (1)repartitioning rows of a table among the SQL Server instances on PDW

compute nodes.o (2)converting fields of rows being loaded into appliance into the appropriate

ODBC types.

Page 7: A Perfect Hybrid

Overview of Polybase• A new feature in PDW V2• Using SQL standard language• Dealing with both structured and unstructured

data(in SQL Server and Hadoop)• Split query processing paradigm• leverages the capabilities of SQL Server PDW,

especially it's cost-based parallel query optimizer and execution engine.

Page 8: A Perfect Hybrid

Use case of Polybase

Page 9: A Perfect Hybrid

Assumption in Polybase

• 1. Polybase makes no assumptions about where HDFS data is

• 2. Nor any assumptions about the OS of data nodes

• 3. Nor the format of HDFS files (i.e. TextFile, RCFile, custom, …)

Page 10: A Perfect Hybrid

Core Components• External Table• HDFS Bridge in DMS• Cost-based query optimizer(wrapping the one in

V1)

Page 11: A Perfect Hybrid

External Table• Create cluster instance

o CREATE HADOOP_CLUSTER GSL_CLUSTER  WITH (namenode=‘hadoop-head’,namenode_port=9000,  jobtracker=‘hadoop-head’,jobtracker_port=9010); 

• Create File Formato CREATE HADOOP_FILEFORMAT TEXT_FORMAT WITH (INPUT_FORMAT=‘polybase.TextInputFormat’, OUTPUT_FORMAT = ‘polybase.TextOutputFormat’, ROW_DELIMITER = '\n', COLUMN_DELIMITER = ‘|’);

• Create External Tableo CREATE EXTERNAL TABLE hdfsCustomer ( c_custkey bigint not null, c_name varchar(25) not null, …… c_comment varchar(117) not null) WITH (LOCATION='/tpch1gb/customer.tbl', FORMAT_OPTIONS (EXTERNAL_CLUSTER = GSL_CLUSTER, EXTERNAL_FILEFORMAT = TEXT_FORMAT));

Page 12: A Perfect Hybrid

HDFS Bridge

Page 13: A Perfect Hybrid

HDFS Bridge• HDFS is a component of DMS• Goal: Transferring data in parallel between the

nodes of Hadoop and PDW clusters.• HDFS shuffle phase: (read data from hadoop)

o 1. Communicate with namenode, get info of fileo 2. Balance number of bytes read by each DMS instance(based on hdfs

info and dms instances count)o 3. Invoke openRecordReader() RecordReader instance: directly

communicate with datanodeo 4. Get data and transfer into ODBC types.(may done in mapreduce job)o 4. Hash function to determine target node for each record

• Write to hadoop is almost the sameo Invoking openRecordWriter()

Page 14: A Perfect Hybrid

Read Process

Page 15: A Perfect Hybrid

Optimizer & Compilation

• Parsingo A Memo data structure of alternative serial plans

• Parallel optimization[in PDW V1]o Bottom-up optimizer to insert data movement operators in the serial

plans

• Cost-based query optimizer:[whether pushing to Hadoop]o Based on statistics\ relative size of two clusters and other factors

• Semantic Compatibilityo Data typeso SQL semanticso Error handling

Page 16: A Perfect Hybrid

Statistics• Define statistics table for external table:

o CREATE STATISTICS hdfsCustomerStats ONo hdfsCustomer (c_custkey);

• Steps to obtain statistics in HDFS:o 1. Read block level sample data from DMS or map jobso 2. Partitioned samples across compute nodes.o 3. Each node calculates a histogram on its portiono 4. Merge all histograms stored in catalog for database.

• An alternative implementation:o In HadoopV2, let Hadoop cluster calculate the histograms. (cost a lot)o Make the best use of computational resource of Hadoop cluster

Page 17: A Perfect Hybrid

Semantic Compatibility

• Data typeso Java primitive typeso Non-primitive typeso Third-party types that can be implementedo Marked those can not be implemented in Java[only can be processed in

PDW]

• SQL semanticso Return of Expressions: implemented in Javao Returning null: eg. A+B (A==null || B==null)?null: (A+B)o Marked those can not be implemented in Java[only can be processed in

PDW]

• Error handlingo Exceptions will come out in SQL should also be throwed in Java

Page 18: A Perfect Hybrid

Example• SELECT count (*)

from Customer WHERE acctbal < 0GROUP BY nationkey

Page 19: A Perfect Hybrid

Optimized Query Plan #1

Page 20: A Perfect Hybrid

Optimized Query Plan #2

Page 21: A Perfect Hybrid

MapReduce Join• Distributed Hash Join

o Support for equi-join

• Implementation: o Build side: the side with smaller size of data. They are materialized in

HDFS.o Probe side: the other side of data. o Partition build side, making build side in-memory to speed up.o Build side may also be replicated.

Page 22: A Perfect Hybrid

Performance Evaluation

• Test configuration:o C-16/48 16 node PDW cluster, 48 node Hadoop clustero C-30/30 30 node PDW cluster, 30 node Hadoop clustero C-60 60 node PDW cluster and 60 node Hadoop cluster

• Test database:o Two identical tables T1 and T2

• 10 billion rows• 13 integer attributes and 3 string attributes (~200 bytes/row)• About 2TB uncompressed

o One copy of each table in HDFS• HDFS block size of 256 MB• Stored as a compressed RCFile• RCFiles store rows “column wise” inside a block

o One copy of each table in PDW• Block-wise compression enabled

Page 23: A Perfect Hybrid

SELECT u1, u2, u3, str1, str2, str4 from T1 (in HDFS) where (u1 % 100) < sf

Selection on HDFS table

1 20 40 60 80 1000

500

1000

1500

2000

2500

PDWImportMR

Selectivity Factor (%)

Exe

cu

tion

Tim

e (

secs.

)

Polybase Phase 2

Polybase Phase 1

Crossover Point:Above a selectivity factor of ~80%, PB Phase 2 is slower

PB.1

SP

PB.1 PB.1 PB.1 PB.1 PB.1

SP

SP

SP

SP

SP

23

Page 24: A Perfect Hybrid

Join HDFS Table with PDW Table

SELECT * from T1 (HDFS), T2 (PDW) where T1.u1 = T2.u2 and (T1.u2 % 100) < sf and (T2.u2 % 100) < 50

1 33 66 1000

500

1000

1500

2000

2500

3000

3500

PDW

Import

MR

Selectivity Factor (%)

Exe

cu

tion

Tim

e (

secs.

)

Polybase Phase 1

Polybase Phase 2

PB.1

SP

SP

SP

SP

PB.1 PB.1 PB.1

24

Page 25: A Perfect Hybrid

Join Two HDFS TablesSELECT * from T1 (HDFS), T2 (HDFS) where T1.u1 = T2.u2 and (T1.u2 % 100) < SF and (T2.u2 % 100) < 10

PB

.1

PB

.2H

PB

.2P

PB

.2H

PB

.2H

PB

.1

PB

.1

PB

.1

1 33 66 1000

500

1000

1500

2000

2500

PDW

Import-Join

MR-Shuffle-J

MR-Shuffle

Import T2

Import T1

MR- Sel T2

MR-Sel T1Selectivity Factor

Execu

tion

Tim

e (

secs.)

PB

.2H

PB

.2P

PB

.2P

PB

.2P

PB.2P – Selections on T1 and T2 pushed to Hadoop. Join performed on PDWPB.1 – All operators on PDWPB.2H – Selections & Join on Hadoop

25

Page 26: A Perfect Hybrid

Performance Wrap-up• Split query processing really works!• Up to 10X performance improvement!• A cost-based optimizer is clearly required to

decide when an operator should be pushed• Optimizer must also incorporate relative cluster

sizes in its decisions

Page 27: A Perfect Hybrid

Reference• Split Query Processing in Polybase(SIGMOD’13 , June 22-

27,2013,New York,USA.)o Microsoft Corporation

• Polybase: What, Why, How(ppt)o Microsoft Corporation

• Query Optimization in Microsoft SQL Server PDW (SIGMOD'12, May 20-24,2012,Scottsdale,Arizona,USA)o Microsoft Corporation

Page 28: A Perfect Hybrid

THANKS