sql and machine learning on hadoop

1 1 Pivotal Confidential–Internal Use Only

SQL & Machine Learning on Hadoop

Mukund Babbar Pivotal Feb, 2015

1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015

Journey to Apache

Michael Stonebraker develops Postgres at UCB

Postgres adds support for SQL

Open Source PostgreSQL

PostgreSQL 7.0 released

PostgreSQL 8.0 released

Greenplum forks PostgreSQL

Hadoop 1.0 Released

HAWQ & MADlib go Apache

HAWQ launched

Hadoop 2.0 Released

MADlib launched

Greenplum open sourced


Apache HAWQ Overview

4

HAWQ – SQL on Hadoop

5

Shared-Nothing Database Architecture

Standby Master

Segment Host with one or more Segment Instances Segment Instances process queries in parallel

High speed interconnect for continuous pipelining of data processing …

Master Host

SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts

Interconnect

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

Segment Hosts have their own CPU, disk and memory (shared nothing)


node1


node2


node3


nodeN

6

Key Features of

HAWQ

5

7

5 •  Up to 30x SQL-‐on-‐Hadoop performance advantage

•  Faster ;me to insight •  Massive MPP scalability to petabytes

Benefits: Near real-‐;me latency, complex queries and advanced analy;cs at scale

1. Advanced Analy9cs Performance

Key Features of

HAWQ

8

HAWQ Performance vs Impala

HAWQ Faster

Impala Faster

2 28 46 66 73 76 79 80 88 90 96

HAWQ •  Faster on 46 of 62

TPC-DS queries completed*

•  4.55x mean avg. •  12 hrs faster total

* Impala supported 74 of 99 queries, 12 crashed mid-run

9

HAWQ vs Apache Hive w/Tez

HAWQ Faster

Hive Faster

3 7 15 25 27 34 46 48 76 79 89 90 96

HAWQ •  Faster on 45 of 60

TPC-DS queries completed*

•  3.44x mean avg. •  9 hrs faster total

* Hive supported 65 of 99 queries, 5 crashed mid-run

10

5 • ANSI SQL-‐92, -‐99, -‐2003 • All 99 TPC-‐DS queries tested, no modifica;ons

• Plus, OLAP extensions • Complete ACID integrity and reliability Benefits: 100% SQL compliant No risk to SQL applica;ons All na;ve on HDP via HAWQ

2. 100% ANSI SQL Compliant

Key Features of

HAWQ

11

5 • Advanced machine learning for big data • Local, in-‐database opera;on • Excep;onal MPP/parallel performance • Open source, Postgres-‐based Benefits: Advanced, highly scalable, machine learning, directly on data in Hadoop

3. Integrated Machine Learning

Key Features of

HAWQ

12

5 • HDP, PHD, other ODPi-‐derived distros • Easily managed via Ambari • On premises, in cloud, or PaaS • HBase, Avro, Parquet and more • Connectors to make HAWQ data available to other SQL query tools Benefits: Flexibility Accessibility Portability

4. Flexible Deployment

Key Features of

HAWQ

13

5 • Cost-‐based query op;miza;on • Robust query plan op;miza;on • Complex big data management

Benefits: Op;mize performance and costs Maximize Hadoop cluster resources Offload EDW w/o compromise

5. Query Op9miza9on Op9ons

Key Features of

HAWQ

14

Advanced MPP: Polymorphic Storage™

�  Columnar storage is well suited to scanning a large percentage of the data

�  Row storage excels at small lookups �  Most systems need to do both �  Row and column orienta;on can be

mixed within a table or database

�  Both types can be drama;cally more efficient with compression

�  Compression is definable column by column: �  Blockwise: Gzip1-‐9 & QuickLZ �  Streamwise: Run Length Encoding (RLE) (levels 1-‐4)

�  Flexible indexing, par;;oning enable more granular control and enable true ILM

TABLE ‘SALES’ Mar Apr May Jun Jul Aug Sept Oct Nov

Row-‐oriented for Small Scans Column-‐oriented for Full Scans

15

PL/X : X in {pgsql, R, Python, Java, Perl, C, etc.}

•  Allows users to write HAWQ functions in R, Perl, Java, Perl, pgsql or C languages

•  The interpreter/VM of the language ‘X’ is installed on each node of the HAWQ Cluster

•  Data Parallelism: –  PL/X piggybacks on

HAWQ’s MPP architecture

16

Apache HAWQ

●  Discover New Rela9onships ●  Enable Data Science ●  Analyze External Sources ●  Query All Data Types!

Mul9-‐level Fault Tolerance

Granular Authoriza9on

Resource Mgmt (+ YARN)

high mul(-‐tenancy

ANSI SQL Standard

OLAP Extensions

JDBC ODBC Connec9vity

Parallel Processing

Online Expansion

HDFS

Petabyte Scale

Cost Based Op9mizer

Dynamic Pipelining

ACID + Transac9onal

Mul9-‐Language UDF Support

Built-‐in Data Science Library

Extensible (PXF)

Query External Sources

Hardened, 10+ Years Investment, Produc9on Proven

Accessibility + Usability

HDFS Na9ve File Formats

●  Manage Mul9ple Workloads ●  Petabyte Scale Analy9cs ●  Security controls

●  Leverage Exis9ng SQL Skills & BI Tools

●  Easily Integrate with Other Tools

●  Sub-‐second Performance

Compression + Par99oning

core

compliance

●  Hadoop-‐Na9ve ●  Supports Pivotal HD

and Hortonworks Data Pladorm

●  Ambari-‐Integrated

17

Apache HAWQ 2.0 (new features..) Areas of Enhancement New Features

Elas;c & Scalable Architecture

Hadoop-‐Na;ve Integra;ons

Simplified External Data Access/Queries

Performance & Op;miza;ons

On-‐Demand Virtual Segments

Flexible Query Dispatch on subset nodes

3 Tier RM: YARN level>User>Query-‐Operator

Dynamic Cluster Expansion (no redistribute)

New Fault Tolerance Service

HCatalog integra;on -‐ Read Access

HDFS Catalog Cache

Per Table Directory storage (user friendly)

Single physical segment per node

Easier Administra;on/Usage

Cloud-‐Ready Simpler Management Commands

18

HAWQ Segments

HAWQ Masters

Yarn

Physical Segment

Client

Parser/ Analyzer

Op;mizer

Dispatcher

DataNode

NodeManager

NameNode NameNode

External Data Stores via Xtension Framework (Hive/HBase/etc)

Resource Manager

Fault Tolerance Service

Catalog Service

Virtual Segment

Virtual Segment

Physical Segment

DataNode

NodeManager

Virtual Segment

Virtual Segment

Physical Segment

DataNode

NodeManager

Virtual Segment

Virtual Segment

Resource Broker

libYARN

HDFS Catalog Cache

Interconnect Interconnect

Apache HAWQ 2.0 Architecture


Apache MADlib Overview

20

Scalable, In-Database Machine Learning

•  Open Source https://github.com/apache/incubator-madlib •  Supports Greenplum DB, Apache HAWQ/HDB and PostgreSQL •  Downloads and Docs: http://madlib.incubator.apache.org/

Apache (incubating)

21

Functions Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers •  Linear Algebra

Matrix Factorization •  Singular Value Decomposition (SVD) •  Low Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards Regression •  Elastic Net Regularization •  Robust Variance (Huber-White), Clustered

Variance, Marginal Effects

Other Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Apriori) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Random Forest •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation •  Naïve Bayes •  Support Vector Machines (SVM)

Descriptive Statistics

Sketch-Based Estimators •  CountMin (Cormode-Muth.) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions Data Preparation PMML Export Conjugate Gradient

Inferential Statistics

Hypothesis Tests

Time Series •  ARIMA

Oct 2014

22

MADlib Advantages

�  Better parallelism –  Algorithms designed to leverage MPP and

Hadoop architecture

�  Better scalability –  Algorithms scale as your data set scales

�  Better predictive accuracy –  Can use all data, not a sample

�  ASF open source (incubating) –  Available for customization and optimization

23

Calling MADlib Functions: Fast Training & Scoring •  MADlib allows users to easily create

models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

•  All the data can be used in one model •  Built-in functionality to create multiple

smaller models (e.g. classification grouped by feature)

•  Open source lets you tweak and extend methods, or build your own

24

Challenges in computing OLS solution

a b c d e f g h

X

Segment 1

Segment 2

25


a b c d e f g h

X

Segment 1

Segment 2

a c e g b d f h

Segm

ent 1

Segm

ent 2

XT

26


a b c d e f g h

X

a c e g b d f h

XT

a2+c2+e2+g2 =

Data across nodes are multiplied

27


a b c d e f g h

X

a c e g b d f h

XT

a2+c2+e2+g2 =

Looks like the result can be decomposed

ab+cd+ef+gh b2+d2+f2+h2

ab+cd+ef+gh

28


a b c d e f g h

X

a c e g b d f h

XT

a2+c2+e2+g2 =

Data across nodes are multiplied!

ab+cd+ef+gh b2+d2+f2+h2

ab+cd+ef+gh

= + a b e f

e f a b + c d g

h g h c

d +

29

Linear Regression on 10 Million Rows in Seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.

30

Contributors Welcome!

•  Web sites –  http://hawq.incubator.apache.org/ –  http://madlib.incubator.apache.org/ –  https://cran.r-project.org/web/packages/PivotalR/index.html

•  Github –  https://github.com/apache/incubator-hawq –  https://github.com/apache/incubator-madlib –  https://github.com/pivotalsoftware/PivotalR

sql and machine learning on hadoop

Technology