initial evaluation bigsql for · pdf file8/2/2017 · linux redhat 7.2 / cent os 7.2...
TRANSCRIPT
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 1
PER STRICKER, THOMAS KALB
07.02.2017, HEART OF TEXAS DB2 USER GROUP, AUSTIN
08.02.2017, DB2 FORUM USER GROUP, DALLAS
INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?)
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 2
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
Hadoop (HDFS)
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 3
http://bradhedlund.s3.amazonaws.com/2011/hadoop-network-intro/Hadoop-Cluster.PNG
Hadoop Distribution
Cloudera / Hortonworks / MapR / IOP (Worldwide Market share)
Hortonworks 16 %
others 20 %
Cloudera
53%
MapR 11 %
Quelle: https://www.dezyre.com/article/top-6-hadoop-vendors-providing-big-data-solutions-in-open-data-platform/93
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 4
Hadoop Appraisal
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 5
Quelle: https://www.cloudera.com/content/dam/www/static/documents/analyst-reports/forrester-wave-big-data-hadoop-distributions.pdf
Hadoop SQL Engines
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 6
Quelle: IBM Big SQL – Vendor Landscape © 2014 IBM Corporation
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 7
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) BIGSQL – Sham or Masterstroke? Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
Conclusion – Sham or Masterstroke? Questions and Discussion
Big SQL and MPP-Architecture
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 8
IBM Big SQL is a high performance SQL- on-Apache-Hadoop- Engine
IBM MPP-engine (C++) replaces the MapReduce-Layer (Java)
Big SQL is a MPP (Massively Parallel Processing) SQL-engine
HIVE extends Hadoop with Data- Warehouse Features
HBASE is a distributed column-oriented database
HDFS is a high availability filesystem for storing very large volumes of data distributed across many nodes.
Quelle: Big SQL: A Technical Introduction © 2016 IBM Corporation
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 9
SMP vs. MPP Architecture
SMP: Dynamically distributes running processes across all available processors which share system resources (multi processor systems)
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 10
SMP vs. MMP Architecture
MPP: Distributes a task across multiple independent nodes with individual processors, RAM and I/O. (Share nothing architecture)
SMP Scaling
Vertical Scaling
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 11
Horizontal Scaling
BIGSQL homerun or merely a major bluff?
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 13
Hadoop
Cluster
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 14
DB2 DPF versus Hadoop (HDFS) Hadoop Cluster (Diploma Thesis)
DB2 DPF
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 15
DB2 DPF
Quelle: toadworld.com
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 16
Big SQL – IBM Slide
Quelle: Big SQL: A Technical Introduction © 2016 IBM Corporation
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 17
BIG SQL – ITGAIN Slide
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 18
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
Conclusion – Sham or Masterstroke? Questions and Discussions
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 19
Installation Stumbling Blocks
ITGAIN Test Environment
Installing two nodes
• Hardware
2 virtual Servers with 8 Cores / 10 GB RAM / SSDs
• Software
Linux RedHat 7.2 / Cent OS 7.2
Ambari 2.2.2.0
Hortonworks Data Platform (HDP) 2.4.2
BETA: Big SQL 4.2 for Hortonworks Data Platform
Extending with two additional identical nodes (DataNode / WorkerNode)
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 20
Installation Stumbling Blocks Red Hat or CentOS?
IBM BigInsights for Apache Hadoop 4.2 only supports
Red Hat Enterprise Linux (RHEL) Server 6.7
Red Hat Enterprise Linux (RHEL) Server 7.2
Hortonworks Data Platform HDP 2.4.2 supports
Red Hat Enterprise Linux (RHEL) 6.x - 7.x
CentOS 6.x - 7.x
Debian 7.x
Oracle Linux 6.x - 7.x
SUSE Linux Enterprise Server (SLES) v11 SP3 / SP4
Ubuntu Precise v12.04
Ubuntu Trusty v14.04
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 21
Installation Stumbling Blocks Red Hat or CentOS?
Recommendation for BETA auf Hortonworks Red Hat Enterprise Linux (RHEL) Server 7.2
Test-Cluster on
Red Hat Enterprise Linux (RHEL) Server 7.2
CentOS 7.2
Installation on both OSes was successful
Installation Stumbling Blocks The HDP Installation with Ambari
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 22
Installation Stumbling Blocks The HDP Installation with Ambari
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 23
Tips and Tricks:
• Very simple installation with Ambari, provided there are no errors
• Therefore: prior to the installation take the time to clear any warnings in the Confirm Hosts and Check Scripts
• In case of Errors: Check the errors output to stderr
Often stderr is empty Typical cause is a timeout
If stderr contains errors Attempt to correct the error and retry
• If the installation crashes it is often easier to retry with a fresh OS
rather than changing the OS and retrying the installation
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 24
Installation Stumbling Blocks The BigSQL Installation
Recommendations: Execute the Big SQL Pre-Checker before the Installation
Pre-Checker Scripts are available in the installation package but need to be extracted
rpm2cpio BigInsights-HDP-1.2.0.0-2.4.el7.x86_64.rpm | cpio -ivd
./var/lib/ambari-server/resources/stacks/HDP/2.4/services/BIGSQL/
package/scripts/bigsql-precheck.sh
rpm2cpio BigInsights-HDP-1.2.0.0-2.4.el7.x86_64.rpm | cpio -ivd
./var/lib/ambari-server/resources/stacks/HDP/2.4/services/BIGSQL/
package/scripts/bigsql-util.sh
All errors should be cleared before starting the installation
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 25
Installation Stumbling Blocks The BigSQL Installation
Execute for ALL servers!
Only when successful should you start the installation
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 26
Installation Stumbling Blocks The BigSQL Installation
Add the Service to a Cluster
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 27
Installation Stumbling Blocks The BigSQL Installation
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 28
Installation Stumbling Blocks The BigSQL Installation
It is always possible to add additional Big SQL Workers to an individual host via Add Services option under Hosts
However, this is not possible on a Big SQL Head Node!
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 29
Installation Stumbling Blocks Extending the Cluster with Ambari
Additional hosts can easily be added with the Add New Hosts – Wizard
Installation Stumbling Blocks Extending the Cluster with Ambari
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 30
Installation Stumbling Blocks Extending the Cluster with Ambari
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 31
Installation Stumbling Blocks Extending the Cluster with Ambari
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 32
Installation Stumbling Blocks Extending the Cluster with Ambari
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 33
Data must be redistributed after the extension
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 34
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
Working with BigSQL – The New and the Familiar
DB2 Interface
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 35
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 36
Where does one find the Tables in HDFS? /apps/hive/warehouse/bigsql.db/firsttable
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 37
Or via the Command line (HDFS Browse):
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 38
Not everything works with the DB2 Command line: For example loading data into a Hadoop Table
What now?
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 39
There is also a Command line for BigSQL: JSqsh (Java SQL Shell) – pronounced "jay-skwish“
According to the docs it should be found in:
/usr/ibmpacks/common-utils/current/jsqsh
BUT:
Working with BigSQL – The New and the Familiar
SOLUTION: JSqsh isn’t part of the BigSQL-Installation
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 40
Working with BigSQL – The New and the Familiar
JSqsh appears in the list of installed clients
JSqsh can also be installed via the OpenSource GitHub- project
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 41
Working with BigSQL – The New and the Familiar
JSqsh Setup:
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 42
Working with BigSQL – The New and the Familiar
JSqsh Setup: driver selection
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 43
Working with BigSQL – The New and the Familiar
JSqsh Setup: Customize the Connection details and save
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 44
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 45
Requesting the table list with Jsqsh
Jsqsh Command help via \help e.g g.: Defining the current schema: use BIGSQL
Requesting a table list in a given schema: \show tables
Working with BigSQL – The New and the Familiar
Starting point: Load data in the Tables Tip: for better Performance load the Load-File with hdfs
hdfs dfs -copyFromLocal /tmp/firsttable.csv /tmp/
hdfs dfs -chmod 777 /tmp/firsttable.csv
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 46
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 47
What happened in the hdfs-Filesystem? a new file has appeared
Working with BigSQL – The New and the Familiar
db2top also works: For example, LOAD
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 48
Working with BigSQL – The New and the Familiar
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 49
Even db2pd works: For example LOAD However LIST UTILITIES does not work
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 50
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
Loading the Benchmark BIGSQL HDFS Table
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 51
The HDFS (DB2-) Blocks
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 52
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 53
BIGSQL HDFS versus DB2 DPF
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 54
BIGSQL HDFS versus DB2 DPF
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 55
DB2 DPF Restrictions
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 56
DB2 DPF Restrictions
Performance differences DB2 DPF versus DB2 HDFS Loading 10 million rows
DB2 HDFS: 64 Sek.
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 57
DB2 DPF: 22 Sek.
Performance differences DB2 DPF versus DB2 HDFS Random I/O Benchmark (Reading von 1023 rows)
DB2 DPF DB2 HDFS Cold: Cold:
Warm: Warm:
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 58
Performance differences DB2 DPF versus DB2 HDFS Read-Ahead I/O Benchmark (Reading von 10 Mio. Rows)
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 59
Warm:
Cold:
Warm:
Cold:
DB2 DPF DB2 HDFS
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 60
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
The Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
The Big Data Deployment (SQL for unstructured Data)
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 61
Working with datatypes for complex data (partially structured)
ARRAY: Collection of data of the same datatype
MAP: Collection of Key-Value pairs
STRUCT: Collection of data with different datatypes
Working with unstructured data is possible via the Serializer and
Deserializer (SerDe)
The SerDe-Interface is instructed how it should process data blocks
There are many Built-In SerDes for example for JSON, Avro, Parquet, Regular Expressions, etc...
Many SerDes are available in the Public Domain
Specific SerDes that may be required can be developed in Java
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 62
Big Data – Working with the ARRAY-Data types
Collection of data of the same datatype
Big Data – Working with MAP Types
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 63
Collection of Key-Value pairs
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 64
Big Data– Working with STRUCTs
Collection of data with different data types
Big Data – Unstructured Data
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 65
Using SerDes in BigSQL
Before using the SerDe.jar-Files it needs to be registered in BigSQL - Only when the jar file has been successfully registered will it be available to BigSQL
3 Steps to Register:
Hive Servers: Copy the SerDe.jar-File in the /lib/ directory
Big SQL Node: Copy the SerDe.jar-File in the /userlib/ directory of each individual node
Restart all BigSQL Services
Big Data – Example of Unstructured Data
Example: Parsing log files with Regular Expression (RegexSerDe)
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 66
Big Data – Example of Unstructured Data
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 67
select * from apache_log fetch first 5 rows only
For example, to correlate Client Data with Web Browser data for analysis of user behavior
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 68
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
Big SQL versus Hive
SQLReplayer
Copyright © 2016 ITGAIN GmbH 69
SQLReplayer
Copyright © 2016 ITGAIN GmbH 70
Hive Big SQL Object Synchronization
Create a table into Hive:
SQLReplayer
Copyright © 2016 ITGAIN GmbH 71
Hive Big SQL Object Synchronization
Synchronize the Hive Tables:
SQLReplayer
Copyright © 2016 ITGAIN GmbH 72
Hive Big SQL Object Synchronization
Test the Big SQL Table:
SQLReplayer
Copyright © 2016 ITGAIN GmbH 73
Hive Big SQL Data Synchronization (Refresh)
Edit the HDFS File:
SQLReplayer
Copyright © 2016 ITGAIN GmbH 74
Hive Big SQL Data Synchronization (Refresh)
Select the Hive Table:
SQLReplayer
Copyright © 2016 ITGAIN GmbH 75
Hive Big SQL Data Synchronization (Refresh)
Synchronization (Refresh):
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 76
Agenda
Introduction The MPP Architecture DB2 DPF and Hadoop (HDFS) Installation stumbling blocks
Red Hat or Centos The HDP Installation with Ambari (See Appendix) The BigSQL Installation
Working with BigSQL Familiar and the New
a. DB2 - Interface b. HDFS - Interface
Der Big Data Deployment (SQL for unstructured Data) DB2 Engine versus HDFS-Engine
Functional Differences Performance Differences
BIG SQL and Hive Conclusion – Sham or Masterstroke? Questions and Discussions
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 77
BIGSQL – Sham or Masterstroke?
Sham
DB2 DPF for HDFS
Masterstroke
The right strategy at the right time
Reuse of existing investments
Increased acceptance via the reuse of SQL
Simple integration of Big Data in an existing infrastructure
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 78
The Big Data Solution
Big SQL Hadoop-Tables are not a replacement for OLTP-DBMS Technology
Big SQL makes it possible to use SQL Requests against existing Hadoop Data (no proprietary storage formats)
All the data are Hadoop files in HDFS
Big SQL was developed to make effective and efficient use of the Hadoop infrastructure Most organizations possess experienced SQL developers
No UPDATE or DELETE is possible on a Hadoop table
Much lower license costs than DPF
Good SQL compatibility
Great monitoring with Speedgain for BIGSQL is available
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 79
The Big Data Solution
Primary Use cases would be:
To move rarely referenced data out of the Data-Warehouse and onto cheaper hardware while maintaining the ability to query the data via SQL
To setup new Data-Warehouse
To filter and analyze unstructured data (such as log files, sensor data and social media) as well as to connect this data to existing structured data (such as via federation)
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 80
Conclusion
Bluff = Homerun
BIGSQL homerun or merely a major bluff?
Copyright © 2016 ITGAIN GmbH 81
Q & A