ibm insight 2014 - advanced warehouse analytics in the cloud

dashDBAdvanced Warehouse Analytics in the Cloud

Torsten SteinbachArmin Stegerer

© 2014 IBM Corporation

Please Note• IBM’s statements regarding its plans, directions, and intent are subject to change or

withdrawal without notice at IBM’s sole discretion.

• Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

• The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.

• The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

2

Disclaimer

© Copyright IBM Corporation 2014. All rights reserved.U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM'S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE.

IBM's statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM's sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

IBM, the IBM logo, ibm.com, Information Management, DB2, DB2 Connect, DB2 OLAP Server, pureScale, System Z, Cognos, solidDB, Informix, Optim, InfoSphere, and z/OS are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of others.

http://www.ibm.com/legal/copytrade.shtml

The Analytics ChallengeBringing Analytics to the DataAnalytics with dashDBPredictive Analytics with RGeoSpatial Analytics with ESRIAnalytic Extension Framework

Data is the Basis of

1“2014 Analytics Market Survey,” research note, Nucleus Research, September 2014.2“Analytics Pays Back $13.01 for Every Dollar Spent,” research note, Nucleus Research, September 2014.3"Predicts 2014: Why You Should Modernize Your Information Infrastructure", November 28, 2013. Gartner.

Increasing investment

71%Faster ROI

13 to 1

Data warehouses will get you there

over 90%of analytics customers plan to increase their

analytics budgets within the next 2 years1

Analytics pays back US$13.01 for every

dollar spent – 1.2 times more than it

did 3 years ago2

of big data implementations will

augment, not replace, existing data warehouses3

Data is the Basis of New Competitive Advantage

NEW COMPETITIVE ADVANTAGE

6

The Analytics Challenge

FRAUD DETECTIONBYING BEHAVIORCROSS-SELLING

HEALTH RISK ASSESSMENT

PORTFOLIO MANAGEMENT

DIGITAL MARKETINGSTORE PLACEMENT

ROUTE OPTIMIZATION PRODUCT PRICING

NEAREST SHOP

TelcoHealth Banking

Insurance

Retail

Transportation GovermentManufacturing

Big Data

Reading the data into the analytic tools

Advanced Analytics Is Much More than OLAP or Calculating Statistics

Source: Wiki:: CRISP-DM Reference Model

Most Time is Spent in Data Discovery and Preparation

Source: RexerAnalytics Data Miner Survey 2008

Some more recent sources claim this to be up to 60-70%

9

Data + Data > 2 x Data

Public Data• Weather• News• Stocks• Social

Media• ...

Enterprise Data• Orders• CRM• Master Data• Operations• ...

Systems of Engagement• IoT• Mobile Apps• Cloud Apps

Correlation of Structured

Data

through overall reduction of systems, not data movements, improved utilization

and the power of mature structured data processing

Optimal ROI of in-db Analytics

Combining various data in a DW can be a

fusion reactor for analytics

• Speed to market• Improved accuracy• Lower cost

Basic Math* Permutation and

Combination* Greatest Common

Divisor and Least Common Multiple*

Conversion of Values* Exponential and

Logarithm* Gamma and Beta

Functions Matrix Algebra+ Area Under Curve* Interpolation Methods*

Transformations MathematicalTime Series

Linear Regression+

Logistic Regression+

Classification

Bayesian

Sampling

Model Testing

Geospatial Data Type

Geometric Functions

Geometric Analysis

Predictive Geospatial* Fuzzy Logix

DB Lytix capabilities

+ Netezza Analytics and Fuzzy Logix DB Lytix capabilities

Data Profiling / Descriptive Statistics+

General Diagnostics

Statistics+

Sampling

Data prep

In-db Analytics provides support for all phases of the analytical process

Descriptive Statistics+

Distance Measures*

Hypothesis Testing*

Chi-Square & Contingency Tables*

Univariate & Multivariate Distributions+

Monte Carlo Simulation*

Autoregressive+

Forecasting*

Association Rules+

Clustering+

Feature Extraction+

Discriminant Analysis*

Data Mining

Statistics

The One Big Reason for In-Database Analytics:Bring Analytics to the Data

• Scalable and high-performance analytics -> Analytics Accelerator

Shorten response times Scale analyzed data volume (both by two to three digit

factors)• The secret sauce

Data Proximity: avoid to move data to analytic tools Scale-out: Run code on the MPP architecture of the WH

engine Talk the language of the user and the application developer

• R, SQL, Java, Python, C++, LUA, etc. Flexible runtime model: scalar, aggregate or table functions,

external executables Coverage: wide variety of algorithms and operators out-of-

the-box: • Predictive, Statistical and GeoSpatial Analytics

• Complements analytic tools … because it allows to accelerate and scale their analytics SPSS, R, SAS, ESRI, FuzzyLogix, Zementis, Aginity, …

IBM Netezza Analytics Ecosystem

PureData for Analytics AMPP Platform

SoftwareDevelopment

Kit

3rd PartyIn-Database

Analytics

NetezzaIn-Database

Analytics

User-DefinedExtensions(UDF,UDA,

UDTF,UDAP)

Transformations

Mathematical

Geospatial

Predictive

Statistics

Time Series

Data Mining

Fuzzy Logix

SAS

Zementis

IBM SPSS

LanguageSupport

(Map/Reduce, Java, R, Python,

Lua, Perl,C, C++, Fortran) Mathworks

Open Source R

BI Tools

Visualization Tools

Eclipse

Open Source R

SAS

IBM SPSS

Apache Hadoop

Cloudera

IBMInfoSphereBigInsights

IBM InfoSphere

Streams

Esri

Netezza Analytics is one of the leaders for in-database analytics, making Netezza an attractive platform for users and third-party vendors in the

predictive analytics space

Analytic Code & Algorithms:

Analytic Data:

Data pulled out and processed in analytic application

Analytic Applications

This is where we start from: All analytic processing done on application side

Analytics of Warehouse Data

SQLs


Analytic Data:

Simple data lookup & massage operations pushed down as SQL operations


Benefit: Acceleration with no SQL skills required

SQLs

Push Down Step 1: BLU tables only logically represented in analytic application

Accelerate Analytics for Warehouse Data

SQLs


Analytic Data:

Call built-in functions via SQL to execute typical algorithms inside db

Cloud Tooling


Benefit: Bring Standard Analytics to the Data

SQLsCanned

Algorithms

Push Down Step 2: Typical and popular algorithms pushed down to canned UDFs in the db


Lang

uage

Fra

mew

ork

(UD

X &

AE

)


Analytic Data:

Deploy customer code and call via special SQL function interfaces

SQLsSQLs

Canned Algorithms


Benefit: Bring Custom Analytics to the Data

Push Down Step 3: Execute entire customer analytic programs inside the db


Modernize existing Data Warehousing with on-demand cloud agility

Embrace the concept of the logical data warehouse by combining cloud and on-premises deployments

Faster insight without the up front infrastructure investment

Full support for hybrid “ground to cloud” deployments

19

Organizations gaining competitive advantage through cloud adoption are reporting: 1

as compared to peer companies who are more cautious about cloud computing1

1http://www-03.ibm.com/press/us/en/pressrelease/42304.wss2http://www.huffingtonpost.com/vala-afshar/the-top-100-cloud-computi_b_3756172.html3http://www.businesswire.com/news/home/20100722005325/en/Cloud-Computing-Delivering-Promise-Doubts-Hold-Adoption#.UufrRKX0B8Y

77% of enterprises are in the initial stages of

cloud adoption2

84% of CIOs cut application costs by moving to the cloud2

58% of IT Decision Makers think cloud solutions give

them better control of their data3

2x revenue growth

2.5x higher gross profit

Cloud is Essential to the Modern Data Warehouse

20

•Cloud-based predictive & cognitive analytics discovery platform

•Designed for business use

• Integrated social collaboration

•Freemium to enterprise versions

•Enable self service access & integration of multiple data sources

•Simplified tools to prepare, refine & secure data

•Open application programming interfaces for application development

•On-premise and cloud / internal & external data

• Rapid deployment of large scale data warehouses

• Enables scaling of both volume and processing speed

• Unified architecture that enables hybrid data processing, on-premise & in the cloud

• In-database analytic capabilities for the best analytic performance

DataWorks dashDB Watson Analytics

IBM’s Analytics Cloud Service Ecosystem

21

dashDB

• Enterprise Plan• Dedicated

infrastructure• Terabyte-scale

capacity• Closed Beta for

qualified accounts

• Deploy within Bluemix cloud-based environment for analytics and warehousing services

• Ingest data from a wide variety of sources

• In-database analytics included

• Pay as you go• Rapid Deployment

• Auto-provisioning from Cloudant management GUI

• Built-in automated synchronization from Cloudant JSON data stores

• Built-in analytics for Cloudant data

• Pay as you go• Rapid Deployment

1 2 3

dashDB – Available With Three Deployment Choices


dashDB Entry Plan

Bare Metal4x8 core

256GB RAM

2x 500GBHDD/root, /opt, /etc

12x 200GB SSD/mnt/bludata0

Swift Object StorageBackup and Metadata

Legacy iSCSI drives(detachable)/mnt/blumeta0

Mount

Data Center 1Gbps connection

1TB local HDD has the OS installed with necessary binaries and scripts stored.– /mnt/blutmp0 (16GB swap space)– /opt, /etc, /usr, /bin …

1.2TB local SSD– /mnt/bludata0 – used for database

Legacy iSCSI drive are used to store DB2 database and configuration.– /mnt/blumeta0 – used for configuration

Backups are stored in Swift Object Storage

Run commands to backup and restore Backs up from iSCSI LUNs to Swift

Restores from Swift to iSCSI LUNs

Data Center

Guardium (Shared)Public Shared 8 Core

16GB RAM100 GB San

DSM (Shared)Public Shared 16 Core

64GB RAM1000GB SAN


dashDB Enterprise 1TB Plan

VM #1Public Shared 16 core @ 2.0GHz

64GB RAM

100GB SAN1 (OS)/root/opt/etc


1TB SAN2 (detachable)/mnt/bludata0

/mnt//bludata0/blumeta0

Mount

Data Center 1Gbps connection

Backs up from SAN2 to SwiftRestores from Swift to SAN2

100GB SAN1 has the OS installed with necessary binaries and scripts stored.– /mnt/blutmp0 (16GB swap space)– /opt, /etc, /usr, /bin …

1TB SAN2 holds the database and configuration for DB2.– /mnt/bludata0 – used for database– /mnt/blumeta0 -> /mnt/bludata0/blumeta0 – used for configuration



Data Center


16GB RAM100 GB San


64GB RAM1000GB SAN


dashDB Enterprise 4TB Plan – Compute Optimized

Bare Metal32 core

256GB RAM

2x 500GBHDD/root/opt/etc


4TB Consistent Perf Storage /mnt/bludata0

/mnt/blumeta0

Mount

Data Center 10 Gbps connection

Backs up from Consistent Perf Storage to SwiftRestores from Swift to Consistent Perf Storage


4TB Consistent Performance Storage 6K IOPs holds the database and configuration for DB2.– /mnt/bludata0 – used for database– /mnt/blumeta0 – used for configuration



Data Center


16GB RAM100 GB San


64GB RAM1000GB SAN


dashDB Enterprise 12TB Plan – Storage Optimized

Bare Metal32 core

256GB RAM



12 TB Consistent Perf Storage /mnt/bludata0

/mnt/blumeta0

Mount


Backs up from Consistent Perf Storage to SwiftRestores from Swift to Consistent Perf Storage


4TB Consistent Performance Storage 6K IOPs holds the database and configuration for DB2.– /mnt/bludata0 – used for database– /mnt/blumeta0 – used for configuration



Data Center


16GB RAM100 GB San


64GB RAM1000GB SAN


Server Outage Availability Scenario for dashDB Enterprise 4TB & 12 TB

Bare Metal #1 - Primary32 core

256GB RAM



4/12TB Consistent Perf Storage /mnt/bludata0/mnt/blumeta0

Mount


Backs up from Consistent Perf Storage to SwiftRestores from Swift to Perf Consistent iSCSI

When primary server (BM #1) fails, its Perf Consistent iSCSI volume is re-mapped from primary server (BM #1) to standby server (BM #2)


Data Center

Bare Metal #2 - Standby32 core

256GB RAM


Mount


16GB RAM100 GB San


64GB RAM1000GB SAN


Coming Up Soon: Initial dashDB MPP Offering

Probably 8 partitions per node of initial cluster 2 TB storage per cluster node One node comparable to the 4TB SMP offering

– bare metal, 16 cores, 128 or 256 GB memory, local storage Smallest cluster offered: 3 nodes, i.e. 6 TB Grow in one node steps, up to 10 nodes (i.e. 20 TB)

–Distributing entire MLNs of initial cluster instead of redistribute data Larger MPP offerings are going to be rolled out in a second phase All this might still change until we release

We Bring The Same Compatible Analytic Platform from Netezza to the Cloud

Analytic Extension FrameworkUDX C++ API

Canned Analytics

Application Integration

AE Framework In-DB R In-DB LUAIn-DB Python In-DB Perl

OLAP Functions

ROW_NUMBER

RANK

LAG LEAD

DENSE_RANKLinear

Regression

Kmeans Clustering Decision Tree

Association Rules

Association Rules

Naive Bayes

Spatial Operators

Contains

Touches

Within

Intersects

Crosses

Overlaps

R Wrapper Watson Analytics ESRI ArcGIS Connector …

Analytics Applications of ISVs and Customers

STDDEV

COVAR

……

Predictive Analytics With R• Very popular language for statisticians and data miners

> val1 <- c(23,54,100,134,200,252,311)> val2 <- sqrt(val1)> lm_vals <- lm(val1~val2)> summary(lm_vals)

Residuals: 1 2 3 4 5 6 7 23.480 -3.052 -16.814 -18.330 -10.170 2.785 22.102

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -108.570 20.645 -5.259 0.0033 ** val2 22.538 1.667 13.523 3.96e-05 ***

:> plot(lm_vals)

• Built-in support for graphs and charting; Large set of math. and statistic packages due to extensibility and very active community

• Data Frames: tables of data maintained in memory of R runtime> col1 <- c(23,54,100)> col2 <- c(”xyz”, ”abc”, ”123”)> col3 <- c(TRUE, FALSE, TRUE)> myDf <- data.frame(col1, col2, col3)

• Data frames can be populated from DB tables via RODBC package> library(RODBC)> myconn <-odbcConnect("mydsn", uid= "db2inst1", pwd= "secret")> myDf <- sqlQuery(myconn, "select * from employees")

dashDB

Predictive Analytics With R In dashDB 1/3• Built-in R runtime & R Studio

• ibmdbR package Data frames logically representing data physically residing in Dynamite tables

> con <- idaConnect("BLUDB", "", "")> idaAnalyticsInit(con)> sysusage<-ida.data.frame('DB2INST1.SHOWCASE_SYSUSAGE')> systems<-ida.data.frame('DB2INST1.SHOWCASE_SYSTEMS')> systypes<-ida.data.frame('DB2INST1.SHOWCASE_SYSTYPES’)

Push down of R data preparation to Dynamite> sysusage2 <- sysusage[sysusage$MEMUSED>50000,c("MEMUSED","USERS")]> mergedSys<-idaMerge(systems, systypes, by='TYPEID')> mergedUsage<-idaMerge(sysusage2, mergedSys, by='SID’)

Push down of analytic algorithms to in-db execution> lm1 <- idaLm(MEMUSED~USERS, mergedUsage)

R StudioBrowser

Any R Runtime

ibmdbR

ibmdbR

Predictive Analytics With R In dashDB 2/3 Dynamite-native implementation of statistical functions

• colnames, cor, cov, dim, head, length, max, mean, min, names, print, sd, summary, var

Logically derived columns pushed down to Dynamite> myDF <- ida.data.frame('DB2INST1.SHOWCASE_SYSUSAGE')> myDF$MemPerUser <- myDF$MEMUSED / myDF$USERS

Sampling of tables in Dynamite> idaSample(myDF, 3)

SID DATE USERS MEMUSED ALERT MemPerUser1 8 2014-02-14 23:39:00.000000 34 5015 f 1472 5 2014-01-22 07:52:00.000000 96 11512 f 1193 7 2013-09-12 05:17:00.000000 39 5592 t 143

Statistics about tables in Dynamite> summary(myDF)

SID USERS MEMUSED ALERT MemPerUser

Min. :0.000 Min. : 3.000 Min. : 350.000 f :3655563 Min. :105.000

1st Qu.:2.000 1st Qu.: 35.000 1st Qu.: 5113.000 t :1344437 1st Qu.:135.000

Median :4.500 Median : 64.000 Median : 9455.000 NA's: NA Median :150.000

Mean : NA Mean : NA Mean : NA Mean : NA

3rd Qu.:7.000 3rd Qu.:111.000 3rd Qu.:16517.000 3rd Qu.:165.000

Max. :9.000 Max. :347.000 Max. :62379.000 Max. :209.000

Statistics about categorical values> idaTable(myDF)

ALERT f t 3655563 1344437

Predictive Analytics With R In dashDB 3/3 Store R objects in Dynamite database

> myPrivateObjects <- ida.list(type='private’)> myPrivateObjects['series100'] <- 1:100> x <- myPrivateObjects['series100’]> X [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 [45] 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 [67] 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 [89] 89 90 91 92 93 94 95 96 97 98 99 100> names(myPrivateObjects) [1] "series100”> myPrivateObjects['series100'] <- NULL

Manage Dynamite tables> idaExistTable('DB2INST1.SHOWCASE_SYSUSAGE') [1] TRUE> idaShowTables()

Schema Name Owner Type 1 BLUADMIN R_OBJECTS_PRIVATE BLUADMIN T 2 BLUADMIN R_OBJECTS_PRIVATE_META BLUADMIN T 3 BLUADMIN R_OBJECTS_PUBLIC BLUADMIN T 4 BLUADMIN R_OBJECTS_PUBLIC_META BLUADMIN T> myView <- idaCreateView(myDF)> idaIsView(myView) [1] TRUE> idaDropView(myView)> idaIsView(myView) [1] FALSE

The Power of Place

• Spatial Awareness is a dramatically increasing property of big data due to mobile computing and Internet of Things

• Spatial Insight is directly available in dashDB through built-in spatial data type and operators, like for instance: WITHIN – E.g.: Show me the clients that are affected by a

power outage! OVERLAPS – E.g.: What are my cell phone customers who

are at risk of cell tower service outage due to upcoming tornados?

TOUCHES – E.g.: Give me the neighboring ZIP areas per customer for customized marketing campaigns!

DISTANCE – E.g.: List me the top 5 closest stores! DISJOINT – E.g.: What are candidates of insurance fraud

because a client submitted a claim from a different place than the case is for?

… and ~100 further operators• Supported and leveraged by ESRI – major spatial tooling

vendor

GeoSpatial Analytics In dashDB

• Implements ISO SQL/MM standard for spatial See

http://www.iso.org/iso/catalogue_detail.htm?csnumber=38651

• Spatial data type ST_GEOMETRY (hierarchy)• Enables spatial joins in database through spatial

operators available as user defined functions• Dedicated support in ESRI tools starting V 10.3 http://www.esri.com/software/arcgis/arcgis-for-desktop/free-trial

• GeoSpatial Applications Examples Telco Location Data Utilities Smart Grid GPS Tracking in Transportation Insurance Demographics Cable Marketing Campaigns Retail Store Placement



http://www.esri.com/software/arcgis/arcgis-for-desktop/free-trial

http://www.esri.com/software/arcgis/arcgis-for-desktop/free-trial

Examples of using ESRI ArcGIS with dashDB 1/3Load spatial data into dashDBDiscover & browse spatial data with ArcCatalog

Counties

Tornado paths over recent 50 years

Examples of using ESRI ArcGIS with dashDB 2/3Combine spatial data from dashDB into interactive maps with ArcMap

Examples of using ESRI ArcGIS with dashDB 3/3Perform spatial joins in dashDB using query layers and visualize results ArcMap

Tornado risk per county

41

Insurance Risk Analysis – Show case overview

Public spatial data sets available online- Historical tornados from 1950s to today

http://www.spc.noaa.gov/gis/svrgis/- Current tornado weather warnings

http://www.nws.noaa.gov/regsci/gis/shapefiles/- US counties

https://www.census.gov/geo/maps-data/data/tiger-line.html

Mobile application generating

spatial data for insurance claims for tornado damage

Cloud warehouse service for analytics and correlation

between customer data and public or third party data

Visualization and spatial analysis capabilities by

Esri ArcGIS

www.bluemix.net

www.cloudant.comdashDB

Cloud service for persistency of

system of engagementInsurance Master Data (customers)


Information Management

Twitter-dashDB Show Case (www.youtube.com/watch?v=9yVNwOs9L4c)

http://american-sniper-analysis.mybluemix.net

The Two Elements of Analytic Extension Framework1. User Defined Extension – UDX – C++ API

Three types of UDXs:• Scalar Functions

SELECT MyXForm(Col1, Col2) FROM MyTab

• Aggregate FunctionsSELECT Col1, MyAgg(Col2) FROM MyTab GROUP BY Col1

• Table FunctionsSELECT b.MyCol1 FROM MyTab a, TABLE(MyTableFunc(a.Col1, a.Col2)) AS b

C++ code compiled and linked within dashDB service Registered via DDL, e.g.

CREATE FUNCTION MyXForm(VARCHAR(ANY), INTEGER) RETURNS VARCHAR(ANY) LANGUAGE CPP PARAMETER STYLE NPSGENERIC EXTERNAL NAME ’mylib.so!cMyFunc’CREATE FUNCTION MyAgg(INTEGER) LANGUAGE CPP RETURNS DOUBLE AGGREGATE WITH (SUM INTEGER) PARAMETER STYLE NPSGENERIC External Name 'mylib.so!cMyAgg'CREATE FUNCTION MyTableFunc(VARARGS) RETURNS TABLE (Col1 INTEGER) LANGUAGE CPP PARAMETER STYLE NPSGENERIC External Name 'mylib.so!cMyUDTF’

2. REST API & tooling for development & deployment:pushFile, pullFile, executeCC, compile, link, promote, createPackage, deployPackage, getProjList, getFileList, executeDDL, executeSQL, dropUDX, ...

class cMyFunc: public nz::udx_ver2::Udf{public: cMyFunc(UdxInit *pInit) : Udf(pInit) { } static nz::udx_ver2::Udf* instantiate(UdxInit *pInit);

virtual nz::udx_ver2::ReturnValue evaluate() { int int1= int32Arg(0); int int2= int32Arg(1); int retVal = int1 * int2;

NZ_UDX_RETURN_INT32(retVal); }};nz::udx_ver2::Udf* cMyFunc::instantiate(UdxInit *pInit){ return new cMyFunc(pInit);}

User Defined Scalar Function API Example

class cMyAgg: public nz::udx_ver2::Uda{

public: GenericSum(UdxInit *pInit) : Uda(pInit) { } static nz::udx_ver2::Uda* instantiate(UdxInit *pInit);

void initializeState() { int64 *s = int64State(0); *s = 0; setStateNull(0, false); } //Accumulate data in states. virtual void accumulate() { if (isArgNull(0)) return; int64 *s = int64State(0); *s += int16Arg(0); } //States flowed in as input; Merge back in state virtual void merge() { accumulate();

} //Merged data copied to input virtual ReturnValue finalResult() { if (isArgNull(0)) NZ_UDX_RETURN_NULL(); setReturnNull(false); NZ_UDX_RETURN_INT64(int64Arg(0)); }};

User Defined Aggregate Function API Example

class OneUdtf : public nz::udx_ver2::Udtf{private: int32 argInt, xcount;public:

static nz::udx_ver2::Udtf* instantiate(UdxInit *pInit);

OneUdtf(UdxInit *pInit) : Udtf(pInit) { }

static nz::udx_ver2::Uda* instantiate(UdxInit *pInit);

virtual void newInputRow(){ argInt=0;for (int i = 0; i < numArgs(); i++){

if(argType(i) == UDX_INT32){argInt = int32Arg(i);

}else{

throwUdxException( "Unknown type");}

}xcount = 1;

} virtual DataAvailable nextOutputRow(){

if (xcount > 5)return Done;

for (int i=0; i < numReturnColumns(); i++) {setReturnColumnNull(i, false);if (returnTypeColumn(i) == UDX_INT32){

*int32ReturnColumn(i) = argInt + xcount;}else{

throwUdxException( "Unknown type");}

}xcount++;return MoreData;

}};

User Defined Table Function API Example

dashDB

push File

REST API

.cpp

compile

.o

pro mote

exeuteDDL

Command Line 3rd Party IDEs

Cloud Web IDE

release

create Package

LogsLogs

pull File

.o .o

Run SQL

.cpp.cpp BLUDB

Catalog

dashDB Developer

Setup

Analytic Extension Development Process

unde

r con

sider

ation

unde

r con

struc

tion

DRDA

link

.so

.zip

deploy Package

Some Examples Highlighting the REST APILogin and keep a cookie for the sessioncurl -d j_username=<User> -d j_password=<PW> https://<IP>:8443/services/loginService -c ck.dat

Upload source filescurl –F cmd=pushFile –F proj=udsf1 –F subDir=src --form "file[0]=@./udsf1.cpp" --form "file[1]=@./opr.cpp" --form "file[2]=@./opr.h" https://<IP>:8443/ida -b ck.dat

Compile source filescurl –d cmd=compile –d proj=udsf1 –d targetDir=bin -d "files={\"files\":[\"src/udsf1.cpp\”]}” https://<IP>:8443/ida -b ck.dat

Link object filescurl –d cmd=link–d proj=udsf1 –d targetDir=bin -d "files={\"files\":[\"bin/udsf1.o\“]}“ https://<IP>:8443/ida -b ck.dat

Alternatively: low-level cc invocationcurl –d cmd=executeCC –d proj=udsf1 -d "args=-m64 -Wall -fPIC -c -D_CPLUSPLUS src/udsf1.cpp -I/mnt/blumeta0/home/db2inst1/sqllib/include -o udsf1.o” https://<IP>:8443/ida -b ck.dat

Promote linked binaries to release directorycurl –d cmd=promote –d proj=udsf1 -d "files=lib*.so“ https://<IP>:8443/ida -b ck.dat

Register UDX with DDLcurl –d cmd=executeDDL –d profileName=BLUDB -d "ddl=CREATE FUNCTION udf1(INT) RETURNS INT LANGUAGE CPP PARAMETER STYLE NPSGENERIC FENCED EXTERNAL NAME '/mnt/blumeta0/home/bluadmin/projects/udsf1/release/libudsf1.so!CUdf';" https://<IP>:8443/blushiftservices/BluShiftHttp.do -b ck.dat

A Proof Point of UDX Support in dashDB

We have working prototype of the entire the Netezza SQL Extension Toolkit for dashDB !!