big data: getting off to a fast start with big sql (world of watson 2016 session alb-3663)

Getting off to a fast start with Big Data analytics using Big SQL (ALB-3663)

C. M. SaraccoOct. 25, 2016

Please note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion.

Information regarding potential future products is intended to outline our general productdirection and it should not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.

The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

10/19/16World of Watson 2016 2

• Big SQL = industry-standard SQL for Hadoop platforms • Easy on-ramp to Hadoop for SQL professionals • Supports familiar SQL tools / apps (via JDBC and ODBC drivers)

• How to get started? • Create tables / views. Store data in HDFS, HBase, or Hive warehouse • Load data into tables (from remote RDBMS, files) • Query data (project, restrict, join, union, wide range of sub-queries, wide range of built-

in functions, UDFs, . . . . ) • Explore advanced features

– Collect statistics and inspect data access plan – Transparently join/union Hadoop & RDBMS data (query federation) – Leverage Big Data technologies: SerDes, Spark, . . .

Overview

• Command line: Java SQL Shell (JSqsh)

• Web tooling (Data Server Manager)

• Tools supporting IBM JDBC/ODBC driver

Invocation options

Creating a Big SQL table

Standard CREATE TABLE DDL with extensionscreate hadoop table users(

id int not null primary key,office_id int null,fname varchar(30) not null,lname varchar(30) not null)

row format delimitedfields terminated by '|'stored as textfile;

Worth noting: • “Hadoop” keyword creates table in DFS • Row format delimited and textfile formats are default • Constraints not enforced (but useful for query optimization)

Results from previous CREATE TABLE . . .

• Data stored in subdirectory of Hive warehouse . . . /hive/warehouse/myid.db/users

• Default schema is user ID. Can create new schemas • “Table” is just a subdirectory under schema.db• Table’s data are files within table subdirectory

• Meta data collected (Big SQL & Hive) • SYSCAT.* and SYSHADOOP.* views

• Optionally, use LOCATION clause of CREATE TABLE to layer Big SQL schema over existing DFS directory contents • Useful if table contents already in DFS • Avoids need to LOAD data

Populating Tables via LOAD

• Typically best runtime performance

• Load data from remote file system

load hadoop using file url'sftp://myID:[email protected]:22/install-dir/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE gosalesdw.GO_REGION_DIM overwrite;

• Loads data from RDBMS (DB2, Netezza, Teradata, Oracle, MS-SQL, Informix) via JDBC

load hadoop using jdbc connection url'jdbc:db2://some.host.com:portNum/sampledb' with parameters (user='myID', password='myPassword') from table MEDIA columns (ID, NAME) where 'CONTACTDATE < ''2012-02-01''' into table media_db2table_jan overwrite with load properties ('num.map.tasks' = 10);

Querying your Big SQL tables

• Same as ISO SQL-compliant RDBMS

• No special query syntax for Hadoop tables • Projections, restrictions

• UNION, INTERSECT, EXCEPT

• Wide range of built-in functions (e.g. OLAP)

• Full support for subqueries

• All standard join operations

• . . .

SELECTs_name,count(*) AS numwait

FROMsupplier,lineitem l1,orders,nation

WHEREs_suppkey = l1.l_suppkeyAND o_orderkey = l1.l_orderkeyAND o_orderstatus = 'F'AND l1.l_receiptdate > l1.l_commitdateAND EXISTS (

SELECT*

FROMlineitem l2

WHEREl2.l_orderkey = l1.l_orderkeyAND l2.l_suppkey <> l1.l_suppkey

)AND NOT EXISTS (

SELECT*

FROMlineitem l3

WHEREl3.l_orderkey = l1.l_orderkeyAND l3.l_suppkey <> l1.l_suppkeyAND l3.l_receiptdate >

l3.l_commitdate)AND s_nationkey = n_nationkeyAND n_name = ':1'

GROUP BY s_nameORDER BY numwait desc, s_name;

A word about . . . SerDes

• Custom serializers / deserializers (SerDes)• Read / write complex or “unusual” data formats (e.g., JSON) • Commonly used by Hadoop community. Developed by user or available publicly

• Add SerDes to directories; reference SerDe when creating table -- Create table for JSON data using open source hive-json-serde-0.2.jar SerDe

-- Location clause points to DFS dir containing JSON data

-- External clause means DFS dir & data won’t be drop after DROP TABLE command

create external hadoop table socialmedia-json (Country varchar(20), . . . )

row format serde 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'

location '</hdfs_path>/myJSON';

select * from socialmedia-json;

Sample JSON input for previous example

JSON-based social media data to load into Big SQL Table socialmedia-json defined with SerDe

Sample Big SQL query output for JSON data

Sample output: Select * from socialmedia-json

Accessing Big SQL data from Spark shell // establish a Hive context and query some Big SQL data

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

val saleFacts = sqlContext.sql("select * from bigsql.sls_sales_fact")

// take some action on the data – count # of rows

saleFacts.count()

. . .

// transform the data as needed (create a Vector with data from 2 cols)

val subset = saleFacts.map {row => Vectors.dense(row.getDouble(16),row.getDouble(17))}

// invoke basic Spark MLlib statistical function over the data

val stats = Statistics.colStats(subset)

// print one of the statistics collected

println(stats.mean)

Big SQL query federation = virtualized data access

Transparent§ Appears to be one source§ Programmers don’t need to know how /

where data is stored

Heterogeneous§ Accesses data from diverse sources

High Function§ Full query support against all data§ Capabilities of sources as well

Autonomous§ Non-disruptive to data sources, existing

applications, systems.

High Performance§ Optimization of distributed queries

SQL tools, applications Data sources

Virtualizeddata

Joining data across sources

Creating and using federated objects (example)

-- Create wrapper to identify client library (Oracle Net8)

CREATE WRAPPER ORA LIBRARY 'libdb2net8.so'

-- Create server for Oracle data source

CREATE SERVER ORASERV TYPE ORACLE VERSION 11 WRAPPER ORA AUTHORIZATION

\”orauser\” PASSWORD \”orauser\” OPTIONS (NODE 'TNSNODENAME', PUSHDOWN 'Y', COLLATING_SEQUENCE 'N');

-- Map the local user 'orauser' to the Oracle user 'orauser' / password 'orauser'

CREATE USER MAPPING FOR orauser SERVER ORASERV OPTIONS ( REMOTE_PASSWORD 'orauser');

-- Create nickname for Oracle table / view

CREATE NICKNAME NICK1 FOR ORASERV.ORAUSER.TABLE1;

-- Query the nickname

SELECT * FROM NICK1 WHERE COL1 < 10;

A word about . . . data access plans

• Cost-based optimizer with query rewrite • Example: SELECT DISTINCT COL_PK,

COL_X . . . FROM TABLE • Automatically rewrite query to avoid sort

– PK constraint implies no nulls or duplicates– Transparent to programmer

• ANALYZE TABLE … collects statistics• Automatic or manual collection• Efficient runtime performance

• EXPLAIN reports detailed access plan • Subset shown at right

Get started with Big SQL: External resources

Hadoop Dev: videos, tutorials, forum, . . . https://developer.ibm.com/hadoop/

http://www.slideshare.net/CynthiaSaracco/presentations

Notices and disclaimers

Copyright © 2016 by International Business Machines Corporation (IBM). No part of this document may be reproduced or transmitted in any form without written permission from IBM.

U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM.

Information in these presentations (including information relating to products that have not yet been announced by IBM) has beenreviewed for accuracy as of the date of initial publication and could include unintentional technical or typographical errors. IBM shall have no responsibility to update this information. THIS DOCUMENT IS DISTRIBUTED "AS IS" WITHOUT ANY WARRANTY, EITHER EXPRESS OR IMPLIED. IN NO EVENT SHALL IBM BE LIABLE FOR ANY DAMAGE ARISING FROM THE USE OF THIS INFORMATION, INCLUDING BUT NOT LIMITED TO, LOSS OF DATA, BUSINESS INTERRUPTION, LOSS OF PROFIT OR LOSS OF OPPORTUNITY. IBM products and services are warranted according to the terms and conditions of the agreements under which they are provided.

IBM products are manufactured from new parts or new and used parts. In some cases, a product may not be new and may have been previously installed. Regardless, our warranty terms apply.”

Any statements regarding IBM's future direction, intent or product plans are subject to change or withdrawal without notice.

Performance data contained herein was generally obtained in a controlled, isolated environments. Customer examples are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual performance, cost, savings or other results in other operating environments may vary.

References in this document to IBM products, programs, or services does not imply that IBM intends to make such products, programs or services available in all countries in which IBM operates or does business.

Workshops, sessions and associated materials may have been prepared by independent session speakers, and do not necessarily reflect the views of IBM. All materials and discussions are provided for informational purposes only, and are neither intended to, nor shall constitute legal or other guidance or advice to any individual participant or their specific situation.

It is the customer’s responsibility to insure its own compliance with legal requirements and to obtain advice of competent legal counsel as to the identification and interpretation of any relevant laws and regulatory requirements that may affect the customer’s business and any actions the customer may need to take to comply with such laws. IBM does not provide legal advice or representor warrant that its services or products will ensure that the customer is in compliance with any law.

18 10/19/16World of Watson 2016

Notices and disclaimers continued

Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products in connection with this publication and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. IBM does not warrant the quality of any third-party products, or the ability of any such third-party products to interoperate with IBM’s products. IBM EXPRESSLY DISCLAIMS ALL WARRANTIES, EXPRESSED OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

The provision of the information contained herein is not intended to, and does not, grant any right or license under any IBM patents, copyrights, trademarks or other intellectual property right.

IBM, the IBM logo, ibm.com, Aspera®, Bluemix, Blueworks Live, CICS, Clearcase, Cognos®, DOORS®, Emptoris®, Enterprise Document Management System™, FASP®, FileNet®, Global Business Services ®, Global Technology Services ®, IBM ExperienceOne™, IBM SmartCloud®, IBM Social Business®, Information on Demand, ILOG, Maximo®, MQIntegrator®, MQSeries®, Netcool®, OMEGAMON, OpenPower, PureAnalytics™, PureApplication®, pureCluster™, PureCoverage®, PureData®, PureExperience®, PureFlex®, pureQuery®, pureScale®, PureSystems®, QRadar®, Rational®, Rhapsody®, Smarter Commerce®, SoDA, SPSS, Sterling Commerce®, StoredIQ, Tealeaf®, Tivoli®, Trusteer®, Unica®, urban{code}®, Watson, WebSphere®, Worklight®, X-Force® and System z® Z/OS, are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at: www.ibm.com/legal/copytrade.shtml.

19 10/19/16World of Watson 2016

Thank You