sql and machine learning on hadoop

31
1 Pivotal Confidential–Internal Use Only SQL & Machine Learning on Hadoop Mukund Babbar Pivotal Feb, 2015

Upload: mukund-babbar

Post on 08-Feb-2017

275 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: SQL and Machine Learning on Hadoop

1 1 Pivotal Confidential–Internal Use Only

SQL & Machine Learning on Hadoop

Mukund Babbar Pivotal Feb, 2015

Page 2: SQL and Machine Learning on Hadoop

1986 … 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014

1995 1997 1999 2001 2003 2005 2007 2009 2011 2013 2015

Journey to Apache

Michael Stonebraker develops Postgres at UCB

Postgres adds support for SQL

Open Source PostgreSQL

PostgreSQL 7.0 released

PostgreSQL 8.0 released

Greenplum forks PostgreSQL

Hadoop 1.0 Released

HAWQ & MADlib go Apache

HAWQ launched

Hadoop 2.0 Released

MADlib launched

Greenplum open sourced

Page 3: SQL and Machine Learning on Hadoop

3 3 Pivotal Confidential–Internal Use Only

Apache HAWQ Overview

Page 4: SQL and Machine Learning on Hadoop

4

HAWQ – SQL on Hadoop

Page 5: SQL and Machine Learning on Hadoop

5

Shared-Nothing Database Architecture

Standby Master

Segment Host with one or more Segment Instances Segment Instances process queries in parallel

High speed interconnect for continuous pipelining of data processing …

Master Host

SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts

Interconnect

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

Segment Hosts have their own CPU, disk and memory (shared nothing)

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

node1

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

node2

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

node3

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

nodeN

Page 6: SQL and Machine Learning on Hadoop

6

Key  Features  of  

HAWQ  

5  

Page 7: SQL and Machine Learning on Hadoop

7

5   •  Up  to  30x  SQL-­‐on-­‐Hadoop  performance  advantage  

•  Faster  ;me  to  insight  •  Massive  MPP  scalability  to  petabytes    

Benefits:    Near  real-­‐;me  latency,  complex                                      queries  and  advanced  analy;cs                                      at  scale      

1.  Advanced  Analy9cs  Performance  

Key  Features  of  

HAWQ  

Page 8: SQL and Machine Learning on Hadoop

8

HAWQ Performance vs Impala

HAWQ Faster

Impala Faster

2 28 46 66 73 76 79 80 88 90 96

HAWQ •  Faster on 46 of 62

TPC-DS queries completed*

•  4.55x mean avg. •  12 hrs faster total

* Impala supported 74 of 99 queries, 12 crashed mid-run

Page 9: SQL and Machine Learning on Hadoop

9

HAWQ vs Apache Hive w/Tez

HAWQ Faster

Hive Faster

3 7 15 25 27 34 46 48 76 79 89 90 96

HAWQ •  Faster on 45 of 60

TPC-DS queries completed*

•  3.44x mean avg. •  9 hrs faster total

* Hive supported 65 of 99 queries, 5 crashed mid-run

Page 10: SQL and Machine Learning on Hadoop

10

5   • ANSI  SQL-­‐92,  -­‐99,  -­‐2003  • All  99  TPC-­‐DS  queries  tested,  no  modifica;ons  

• Plus,  OLAP  extensions  • Complete  ACID  integrity  and  reliability    Benefits:    100%  SQL  compliant                                      No  risk  to  SQL  applica;ons                                      All  na;ve  on  HDP  via  HAWQ  

2.  100%  ANSI  SQL  Compliant  

Key  Features  of  

HAWQ  

Page 11: SQL and Machine Learning on Hadoop

11

5  • Advanced  machine  learning  for  big  data  • Local,  in-­‐database  opera;on  • Excep;onal  MPP/parallel  performance  • Open  source,  Postgres-­‐based    Benefits:    Advanced,  highly  scalable,                                      machine  learning,  directly  on                                      data  in  Hadoop  

3.  Integrated  Machine  Learning  

Key  Features  of  

HAWQ  

Page 12: SQL and Machine Learning on Hadoop

12

5   • HDP,  PHD,  other  ODPi-­‐derived  distros  • Easily  managed  via  Ambari  • On  premises,  in  cloud,  or  PaaS  • HBase,  Avro,  Parquet  and  more  • Connectors  to  make  HAWQ  data  available  to  other  SQL  query  tools    Benefits:    Flexibility                                      Accessibility                                      Portability      

4.  Flexible  Deployment  

Key  Features  of  

HAWQ  

Page 13: SQL and Machine Learning on Hadoop

13

5   • Cost-­‐based  query  op;miza;on    • Robust  query  plan  op;miza;on    • Complex  big  data  management      

Benefits:    Op;mize  performance  and  costs                                      Maximize  Hadoop  cluster  resources                                      Offload  EDW  w/o  compromise  

5.  Query  Op9miza9on  Op9ons  

Key  Features  of  

HAWQ  

Page 14: SQL and Machine Learning on Hadoop

14

Advanced  MPP:  Polymorphic  Storage™  

�  Columnar  storage  is  well  suited  to  scanning  a  large  percentage  of  the  data  

�  Row  storage  excels  at  small  lookups  �  Most  systems  need  to  do  both  �  Row  and  column  orienta;on  can  be  

mixed  within  a  table  or  database  

�  Both  types  can  be  drama;cally  more  efficient  with  compression  

�  Compression  is  definable  column  by  column:  �  Blockwise:  Gzip1-­‐9  &  QuickLZ  �  Streamwise:    Run  Length  Encoding  (RLE)  (levels  1-­‐4)  

�  Flexible  indexing,  par;;oning  enable  more  granular  control  and  enable  true  ILM  

TABLE ‘SALES’ Mar Apr May Jun Jul Aug Sept Oct Nov

Row-­‐oriented  for  Small  Scans  Column-­‐oriented  for  Full  Scans  

Page 15: SQL and Machine Learning on Hadoop

15

PL/X : X in {pgsql, R, Python, Java, Perl, C, etc.}

•  Allows users to write HAWQ functions in R, Perl, Java, Perl, pgsql or C languages

•  The interpreter/VM of the language ‘X’ is installed on each node of the HAWQ Cluster

•  Data Parallelism: –  PL/X piggybacks on

HAWQ’s MPP architecture

Page 16: SQL and Machine Learning on Hadoop

16

Apache HAWQ  

●  Discover  New  Rela9onships  ●  Enable  Data  Science    ●  Analyze  External  Sources  ●  Query  All  Data  Types!  

Mul9-­‐level  Fault  Tolerance  

Granular  Authoriza9on  

Resource  Mgmt  (+  YARN)    

high  mul(-­‐tenancy  

ANSI  SQL  Standard  

OLAP  Extensions  

JDBC  ODBC  Connec9vity  

Parallel  Processing  

Online  Expansion  

HDFS  

Petabyte  Scale    

Cost  Based  Op9mizer  

Dynamic  Pipelining  

ACID  +  Transac9onal  

Mul9-­‐Language  UDF  Support  

Built-­‐in  Data  Science  Library  

Extensible  (PXF)  

Query  External  Sources  

Hardened,  10+  Years  Investment,  Produc9on  Proven  

Accessibility  +  Usability    

HDFS  Na9ve  File  Formats  

●  Manage  Mul9ple  Workloads  ●  Petabyte  Scale  Analy9cs  ●  Security  controls  

●  Leverage  Exis9ng  SQL  Skills  &  BI  Tools  

●  Easily  Integrate  with  Other  Tools  

●  Sub-­‐second  Performance  

   

Compression  +  Par99oning  

core  

compliance  

●  Hadoop-­‐Na9ve  ●  Supports  Pivotal  HD  

and  Hortonworks  Data  Pladorm  

●  Ambari-­‐Integrated      

Page 17: SQL and Machine Learning on Hadoop

17

Apache HAWQ 2.0 (new features..) Areas  of  Enhancement   New  Features  

Elas;c  &  Scalable  Architecture  

Hadoop-­‐Na;ve  Integra;ons  

Simplified  External  Data  Access/Queries  

Performance  &  Op;miza;ons  

On-­‐Demand  Virtual  Segments  

Flexible  Query  Dispatch  on  subset  nodes  

3  Tier  RM:  YARN  level>User>Query-­‐Operator  

Dynamic  Cluster  Expansion  (no  redistribute)  

New  Fault  Tolerance  Service  

HCatalog  integra;on  -­‐  Read  Access  

HDFS  Catalog  Cache  

Per  Table  Directory  storage  (user  friendly)  

Single  physical  segment  per  node  

Easier  Administra;on/Usage  

Cloud-­‐Ready  Simpler  Management  Commands  

Page 18: SQL and Machine Learning on Hadoop

18

HAWQ Segments

HAWQ  Masters  

Yarn  

Physical  Segment  

Client  

 Parser/  Analyzer  

 

Op;mizer  

Dispatcher  

DataNode  

NodeManager  

NameNode NameNode  

External Data Stores via Xtension Framework (Hive/HBase/etc)

Resource  Manager  

Fault  Tolerance  Service  

Catalog Service

Virtual  Segment  

Virtual  Segment  

Physical  Segment  

DataNode  

NodeManager  

Virtual  Segment  

Virtual  Segment  

Physical  Segment  

DataNode  

NodeManager  

Virtual  Segment  

Virtual  Segment  

Resource  Broker  

libYARN

HDFS  Catalog  Cache  

Interconnect Interconnect

Apache HAWQ 2.0 Architecture  

Page 19: SQL and Machine Learning on Hadoop

19 19 Pivotal Confidential–Internal Use Only

Apache MADlib Overview

Page 20: SQL and Machine Learning on Hadoop

20

Scalable, In-Database Machine Learning

•  Open Source https://github.com/apache/incubator-madlib •  Supports Greenplum DB, Apache HAWQ/HDB and PostgreSQL •  Downloads and Docs: http://madlib.incubator.apache.org/

Apache (incubating)

Page 21: SQL and Machine Learning on Hadoop

21

Functions Predictive Modeling Library

Linear Systems •  Sparse and Dense Solvers •  Linear Algebra

Matrix Factorization •  Singular Value Decomposition (SVD) •  Low Rank

Generalized Linear Models •  Linear Regression •  Logistic Regression •  Multinomial Logistic Regression •  Cox Proportional Hazards Regression •  Elastic Net Regularization •  Robust Variance (Huber-White), Clustered

Variance, Marginal Effects

Other Machine Learning Algorithms •  Principal Component Analysis (PCA) •  Association Rules (Apriori) •  Topic Modeling (Parallel LDA) •  Decision Trees •  Random Forest •  Support Vector Machines •  Conditional Random Field (CRF) •  Clustering (K-means) •  Cross Validation •  Naïve Bayes •  Support Vector Machines (SVM)

Descriptive Statistics

Sketch-Based Estimators •  CountMin (Cormode-Muth.) •  FM (Flajolet-Martin) •  MFV (Most Frequent Values) Correlation Summary

Support Modules

Array Operations Sparse Vectors Random Sampling Probability Functions Data Preparation PMML Export Conjugate Gradient

Inferential Statistics

Hypothesis Tests

Time Series •  ARIMA

Oct 2014

Page 22: SQL and Machine Learning on Hadoop

22

MADlib Advantages

�  Better parallelism –  Algorithms designed to leverage MPP and

Hadoop architecture

�  Better scalability –  Algorithms scale as your data set scales

�  Better predictive accuracy –  Can use all data, not a sample

�  ASF open source (incubating) –  Available for customization and optimization

Page 23: SQL and Machine Learning on Hadoop

23

Calling MADlib Functions: Fast Training & Scoring •  MADlib allows users to easily create

models without moving data out of the systems

–  Model generation –  Model validation –  Scoring (evaluation of) new data

•  All the data can be used in one model •  Built-in functionality to create multiple

smaller models (e.g. classification grouped by feature)

•  Open source lets you tweak and extend methods, or build your own

Page 24: SQL and Machine Learning on Hadoop

24

Challenges in computing OLS solution

a b c d e f g h

X

Segment 1

Segment 2

Page 25: SQL and Machine Learning on Hadoop

25

Challenges in computing OLS solution

a b c d e f g h

X

Segment 1

Segment 2

a c e g b d f h

Segm

ent 1

Segm

ent 2

XT

Page 26: SQL and Machine Learning on Hadoop

26

Challenges in computing OLS solution

a b c d e f g h

X

a c e g b d f h

XT

a2+c2+e2+g2 =

Data across nodes are multiplied

Page 27: SQL and Machine Learning on Hadoop

27

Challenges in computing OLS solution

a b c d e f g h

X

a c e g b d f h

XT

a2+c2+e2+g2 =

Looks like the result can be decomposed

ab+cd+ef+gh b2+d2+f2+h2

ab+cd+ef+gh

Page 28: SQL and Machine Learning on Hadoop

28

Challenges in computing OLS solution

a b c d e f g h

X

a c e g b d f h

XT

a2+c2+e2+g2 =

Data across nodes are multiplied!

ab+cd+ef+gh b2+d2+f2+h2

ab+cd+ef+gh

= + a b e f

e f a b + c d g

h g h c

d +

Page 29: SQL and Machine Learning on Hadoop

29

Linear Regression on 10 Million Rows in Seconds

Hellerstein, Joseph M., et al. "The MADlib analytics library: or MAD skills, the SQL." Proceedings of the VLDB Endowment 5.12 (2012): 1700-1711.

Page 30: SQL and Machine Learning on Hadoop

30

Contributors Welcome!

•  Web sites –  http://hawq.incubator.apache.org/ –  http://madlib.incubator.apache.org/ –  https://cran.r-project.org/web/packages/PivotalR/index.html

•  Github –  https://github.com/apache/incubator-hawq –  https://github.com/apache/incubator-madlib –  https://github.com/pivotalsoftware/PivotalR

Page 31: SQL and Machine Learning on Hadoop

31

?