ibm insight 2014 - advanced warehouse analytics in the cloud
TRANSCRIPT
dashDBAdvanced Warehouse Analytics in the Cloud
Torsten SteinbachArmin Stegerer
© 2014 IBM Corporation
Please Note• IBM’s statements regarding its plans, directions, and intent are subject to change or
withdrawal without notice at IBM’s sole discretion.
• Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
• The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.
• The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
2
Disclaimer
© Copyright IBM Corporation 2014. All rights reserved.U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.
THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR INFORMATIONAL PURPOSES ONLY. WHILE EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS” WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED. IN ADDITION, THIS INFORMATION IS BASED ON IBM'S CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM WITHOUT NOTICE. IBM SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR ANY OTHER DOCUMENTATION. NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, NOR SHALL HAVE THE EFFECT OF, CREATING ANY WARRANTIES OR REPRESENTATIONS FROM IBM (OR ITS SUPPLIERS OR LICENSORS), OR ALTERING THE TERMS AND CONDITIONS OF ANY AGREEMENT OR LICENSE GOVERNING THE USE OF IBM PRODUCTS AND/OR SOFTWARE.
IBM's statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM's sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
IBM, the IBM logo, ibm.com, Information Management, DB2, DB2 Connect, DB2 OLAP Server, pureScale, System Z, Cognos, solidDB, Informix, Optim, InfoSphere, and z/OS are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
The Analytics ChallengeBringing Analytics to the DataAnalytics with dashDBPredictive Analytics with RGeoSpatial Analytics with ESRIAnalytic Extension Framework
Data is the Basis of
1“2014 Analytics Market Survey,” research note, Nucleus Research, September 2014.2“Analytics Pays Back $13.01 for Every Dollar Spent,” research note, Nucleus Research, September 2014.3"Predicts 2014: Why You Should Modernize Your Information Infrastructure", November 28, 2013. Gartner.
Increasing investment
71%Faster ROI
13 to 1
Data warehouses will get you there
over 90%of analytics customers plan to increase their
analytics budgets within the next 2 years1
Analytics pays back US$13.01 for every
dollar spent – 1.2 times more than it
did 3 years ago2
of big data implementations will
augment, not replace, existing data warehouses3
Data is the Basis of New Competitive Advantage
NEW COMPETITIVE ADVANTAGE
6
The Analytics Challenge
FRAUD DETECTIONBYING BEHAVIORCROSS-SELLING
HEALTH RISK ASSESSMENT
PORTFOLIO MANAGEMENT
DIGITAL MARKETINGSTORE PLACEMENT
ROUTE OPTIMIZATION PRODUCT PRICING
NEAREST SHOP
TelcoHealth Banking
Insurance
Retail
Transportation GovermentManufacturing
Big Data
Reading the data into the analytic tools
Advanced Analytics Is Much More than OLAP or Calculating Statistics
Source: Wiki:: CRISP-DM Reference Model
Most Time is Spent in Data Discovery and Preparation
Source: RexerAnalytics Data Miner Survey 2008
Some more recent sources claim this to be up to 60-70%
9
Data + Data > 2 x Data
Public Data• Weather• News• Stocks• Social
Media• ...
Enterprise Data• Orders• CRM• Master Data• Operations• ...
Systems of Engagement• IoT• Mobile Apps• Cloud Apps
Correlation of Structured
Data
through overall reduction of systems, not data movements, improved utilization
and the power of mature structured data processing
Optimal ROI of in-db Analytics
Combining various data in a DW can be a
fusion reactor for analytics
• Speed to market• Improved accuracy• Lower cost
Basic Math* Permutation and
Combination* Greatest Common
Divisor and Least Common Multiple*
Conversion of Values* Exponential and
Logarithm* Gamma and Beta
Functions Matrix Algebra+ Area Under Curve* Interpolation Methods*
Transformations MathematicalTime Series
Linear Regression+
Logistic Regression+
Classification
Bayesian
Sampling
Model Testing
Geospatial Data Type
Geometric Functions
Geometric Analysis
Predictive Geospatial* Fuzzy Logix
DB Lytix capabilities
+ Netezza Analytics and Fuzzy Logix DB Lytix capabilities
Data Profiling / Descriptive Statistics+
General Diagnostics
Statistics+
Sampling
Data prep
In-db Analytics provides support for all phases of the analytical process
Descriptive Statistics+
Distance Measures*
Hypothesis Testing*
Chi-Square & Contingency Tables*
Univariate & Multivariate Distributions+
Monte Carlo Simulation*
Autoregressive+
Forecasting*
Association Rules+
Clustering+
Feature Extraction+
Discriminant Analysis*
Data Mining
Statistics
The Analytics ChallengeBringing Analytics to the DataAnalytics with dashDBPredictive Analytics with RGeoSpatial Analytics with ESRIAnalytic Extension Framework
The One Big Reason for In-Database Analytics:Bring Analytics to the Data
• Scalable and high-performance analytics -> Analytics Accelerator
Shorten response times Scale analyzed data volume (both by two to three digit
factors)• The secret sauce
Data Proximity: avoid to move data to analytic tools Scale-out: Run code on the MPP architecture of the WH
engine Talk the language of the user and the application developer
• R, SQL, Java, Python, C++, LUA, etc. Flexible runtime model: scalar, aggregate or table functions,
external executables Coverage: wide variety of algorithms and operators out-of-
the-box: • Predictive, Statistical and GeoSpatial Analytics
• Complements analytic tools … because it allows to accelerate and scale their analytics SPSS, R, SAS, ESRI, FuzzyLogix, Zementis, Aginity, …
IBM Netezza Analytics Ecosystem
PureData for Analytics AMPP Platform
SoftwareDevelopment
Kit
3rd PartyIn-Database
Analytics
NetezzaIn-Database
Analytics
User-DefinedExtensions(UDF,UDA,
UDTF,UDAP)
Transformations
Mathematical
Geospatial
Predictive
Statistics
Time Series
Data Mining
Fuzzy Logix
SAS
Zementis
IBM SPSS
LanguageSupport
(Map/Reduce, Java, R, Python,
Lua, Perl,C, C++, Fortran) Mathworks
Open Source R
BI Tools
Visualization Tools
Eclipse
Open Source R
SAS
IBM SPSS
Apache Hadoop
Cloudera
IBMInfoSphereBigInsights
IBM InfoSphere
Streams
Esri
Netezza Analytics is one of the leaders for in-database analytics, making Netezza an attractive platform for users and third-party vendors in the
predictive analytics space
Analytic Code & Algorithms:
Analytic Data:
Data pulled out and processed in analytic application
Analytic Applications
This is where we start from: All analytic processing done on application side
Analytics of Warehouse Data
SQLs
Analytic Code & Algorithms:
Analytic Data:
Simple data lookup & massage operations pushed down as SQL operations
Analytic Applications
Benefit: Acceleration with no SQL skills required
SQLs
Push Down Step 1: BLU tables only logically represented in analytic application
Accelerate Analytics for Warehouse Data
SQLs
Analytic Code & Algorithms:
Analytic Data:
Call built-in functions via SQL to execute typical algorithms inside db
Cloud Tooling
Analytic Applications
Benefit: Bring Standard Analytics to the Data
SQLsCanned
Algorithms
Push Down Step 2: Typical and popular algorithms pushed down to canned UDFs in the db
Accelerate Analytics for Warehouse Data
Lang
uage
Fra
mew
ork
(UD
X &
AE
)
Analytic Code & Algorithms:
Analytic Data:
Deploy customer code and call via special SQL function interfaces
SQLsSQLs
Canned Algorithms
Analytic Applications
Benefit: Bring Custom Analytics to the Data
Push Down Step 3: Execute entire customer analytic programs inside the db
Accelerate Analytics for Warehouse Data
The Analytics ChallengeBringing Analytics to the DataAnalytics with dashDBPredictive Analytics with RGeoSpatial Analytics with ESRIAnalytic Extension Framework
Modernize existing Data Warehousing with on-demand cloud agility
Embrace the concept of the logical data warehouse by combining cloud and on-premises deployments
Faster insight without the up front infrastructure investment
Full support for hybrid “ground to cloud” deployments
19
Organizations gaining competitive advantage through cloud adoption are reporting: 1
as compared to peer companies who are more cautious about cloud computing1
1http://www-03.ibm.com/press/us/en/pressrelease/42304.wss2http://www.huffingtonpost.com/vala-afshar/the-top-100-cloud-computi_b_3756172.html3http://www.businesswire.com/news/home/20100722005325/en/Cloud-Computing-Delivering-Promise-Doubts-Hold-Adoption#.UufrRKX0B8Y
77% of enterprises are in the initial stages of
cloud adoption2
84% of CIOs cut application costs by moving to the cloud2
58% of IT Decision Makers think cloud solutions give
them better control of their data3
2x revenue growth
2.5x higher gross profit
Cloud is Essential to the Modern Data Warehouse
20
•Cloud-based predictive & cognitive analytics discovery platform
•Designed for business use
• Integrated social collaboration
•Freemium to enterprise versions
•Enable self service access & integration of multiple data sources
•Simplified tools to prepare, refine & secure data
•Open application programming interfaces for application development
•On-premise and cloud / internal & external data
• Rapid deployment of large scale data warehouses
• Enables scaling of both volume and processing speed
• Unified architecture that enables hybrid data processing, on-premise & in the cloud
• In-database analytic capabilities for the best analytic performance
DataWorks dashDB Watson Analytics
IBM’s Analytics Cloud Service Ecosystem
21
dashDB
• Enterprise Plan• Dedicated
infrastructure• Terabyte-scale
capacity• Closed Beta for
qualified accounts
• Deploy within Bluemix cloud-based environment for analytics and warehousing services
• Ingest data from a wide variety of sources
• In-database analytics included
• Pay as you go• Rapid Deployment
• Auto-provisioning from Cloudant management GUI
• Built-in automated synchronization from Cloudant JSON data stores
• Built-in analytics for Cloudant data
• Pay as you go• Rapid Deployment
1 2 3
dashDB – Available With Three Deployment Choices
© 2014 IBM Corporation
dashDB Entry Plan
Bare Metal4x8 core
256GB RAM
2x 500GBHDD/root, /opt, /etc
12x 200GB SSD/mnt/bludata0
Swift Object StorageBackup and Metadata
Legacy iSCSI drives(detachable)/mnt/blumeta0
Mount
Data Center 1Gbps connection
1TB local HDD has the OS installed with necessary binaries and scripts stored.– /mnt/blutmp0 (16GB swap space)– /opt, /etc, /usr, /bin …
1.2TB local SSD– /mnt/bludata0 – used for database
Legacy iSCSI drive are used to store DB2 database and configuration.– /mnt/blumeta0 – used for configuration
Backups are stored in Swift Object Storage
Run commands to backup and restore Backs up from iSCSI LUNs to Swift
Restores from Swift to iSCSI LUNs
Data Center
Guardium (Shared)Public Shared 8 Core
16GB RAM100 GB San
DSM (Shared)Public Shared 16 Core
64GB RAM1000GB SAN
© 2014 IBM Corporation
dashDB Enterprise 1TB Plan
VM #1Public Shared 16 core @ 2.0GHz
64GB RAM
100GB SAN1 (OS)/root/opt/etc
Swift Object StorageBackup and Metadata
1TB SAN2 (detachable)/mnt/bludata0
/mnt//bludata0/blumeta0
Mount
Data Center 1Gbps connection
Backs up from SAN2 to SwiftRestores from Swift to SAN2
100GB SAN1 has the OS installed with necessary binaries and scripts stored.– /mnt/blutmp0 (16GB swap space)– /opt, /etc, /usr, /bin …
1TB SAN2 holds the database and configuration for DB2.– /mnt/bludata0 – used for database– /mnt/blumeta0 -> /mnt/bludata0/blumeta0 – used for configuration
Backups are stored in Swift Object Storage
Swift Object StorageBackup and Metadata
Data Center
Guardium (Shared)Public Shared 8 Core
16GB RAM100 GB San
DSM (Shared)Public Shared 16 Core
64GB RAM1000GB SAN
© 2014 IBM Corporation
dashDB Enterprise 4TB Plan – Compute Optimized
Bare Metal32 core
256GB RAM
2x 500GBHDD/root/opt/etc
Swift Object StorageBackup and Metadata
4TB Consistent Perf Storage /mnt/bludata0
/mnt/blumeta0
Mount
Data Center 10 Gbps connection
Backs up from Consistent Perf Storage to SwiftRestores from Swift to Consistent Perf Storage
1TB local HDD has the OS installed with necessary binaries and scripts stored.– /mnt/blutmp0 (16GB swap space)– /opt, /etc, /usr, /bin …
4TB Consistent Performance Storage 6K IOPs holds the database and configuration for DB2.– /mnt/bludata0 – used for database– /mnt/blumeta0 – used for configuration
Backups are stored in Swift Object Storage
Swift Object StorageBackup and Metadata
Data Center
Guardium (Shared)Public Shared 8 Core
16GB RAM100 GB San
DSM (Shared)Public Shared 16 Core
64GB RAM1000GB SAN
© 2014 IBM Corporation
dashDB Enterprise 12TB Plan – Storage Optimized
Bare Metal32 core
256GB RAM
2x 500GBHDD/root/opt/etc
Swift Object StorageBackup and Metadata
12 TB Consistent Perf Storage /mnt/bludata0
/mnt/blumeta0
Mount
Data Center 10 Gbps connection
Backs up from Consistent Perf Storage to SwiftRestores from Swift to Consistent Perf Storage
1TB local HDD has the OS installed with necessary binaries and scripts stored.– /mnt/blutmp0 (16GB swap space)– /opt, /etc, /usr, /bin …
4TB Consistent Performance Storage 6K IOPs holds the database and configuration for DB2.– /mnt/bludata0 – used for database– /mnt/blumeta0 – used for configuration
Backups are stored in Swift Object Storage
Swift Object StorageBackup and Metadata
Data Center
Guardium (Shared)Public Shared 8 Core
16GB RAM100 GB San
DSM (Shared)Public Shared 16 Core
64GB RAM1000GB SAN
© 2014 IBM Corporation
Server Outage Availability Scenario for dashDB Enterprise 4TB & 12 TB
Bare Metal #1 - Primary32 core
256GB RAM
2x 500GBHDD/root/opt/etc
Swift Object StorageBackup and Metadata
4/12TB Consistent Perf Storage /mnt/bludata0/mnt/blumeta0
Mount
Data Center 10 Gbps connection
Backs up from Consistent Perf Storage to SwiftRestores from Swift to Perf Consistent iSCSI
When primary server (BM #1) fails, its Perf Consistent iSCSI volume is re-mapped from primary server (BM #1) to standby server (BM #2)
Swift Object StorageBackup and Metadata
Data Center
Bare Metal #2 - Standby32 core
256GB RAM
2x 500GBHDD/root/opt/etc
Mount
Guardium (Shared)Public Shared 8 Core
16GB RAM100 GB San
DSM (Shared)Public Shared 16 Core
64GB RAM1000GB SAN
© 2014 IBM Corporation
Coming Up Soon: Initial dashDB MPP Offering
Probably 8 partitions per node of initial cluster 2 TB storage per cluster node One node comparable to the 4TB SMP offering
– bare metal, 16 cores, 128 or 256 GB memory, local storage Smallest cluster offered: 3 nodes, i.e. 6 TB Grow in one node steps, up to 10 nodes (i.e. 20 TB)
–Distributing entire MLNs of initial cluster instead of redistribute data Larger MPP offerings are going to be rolled out in a second phase All this might still change until we release
We Bring The Same Compatible Analytic Platform from Netezza to the Cloud
Analytic Extension FrameworkUDX C++ API
Canned Analytics
Application Integration
AE Framework In-DB R In-DB LUAIn-DB Python In-DB Perl
OLAP Functions
ROW_NUMBER
RANK
LAG LEAD
DENSE_RANKLinear
Regression
Kmeans Clustering Decision Tree
Association Rules
Association Rules
Naive Bayes
Spatial Operators
Contains
Touches
Within
Intersects
Crosses
Overlaps
R Wrapper Watson Analytics ESRI ArcGIS Connector …
Analytics Applications of ISVs and Customers
STDDEV
COVAR
……
The Analytics ChallengeBringing Analytics to the DataAnalytics with dashDBPredictive Analytics with RGeoSpatial Analytics with ESRIAnalytic Extension Framework
Predictive Analytics With R• Very popular language for statisticians and data miners
> val1 <- c(23,54,100,134,200,252,311)> val2 <- sqrt(val1)> lm_vals <- lm(val1~val2)> summary(lm_vals)
Residuals: 1 2 3 4 5 6 7 23.480 -3.052 -16.814 -18.330 -10.170 2.785 22.102
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -108.570 20.645 -5.259 0.0033 ** val2 22.538 1.667 13.523 3.96e-05 ***
:> plot(lm_vals)
• Built-in support for graphs and charting; Large set of math. and statistic packages due to extensibility and very active community
• Data Frames: tables of data maintained in memory of R runtime> col1 <- c(23,54,100)> col2 <- c(”xyz”, ”abc”, ”123”)> col3 <- c(TRUE, FALSE, TRUE)> myDf <- data.frame(col1, col2, col3)
• Data frames can be populated from DB tables via RODBC package> library(RODBC)> myconn <-odbcConnect("mydsn", uid= "db2inst1", pwd= "secret")> myDf <- sqlQuery(myconn, "select * from employees")
dashDB
Predictive Analytics With R In dashDB 1/3• Built-in R runtime & R Studio
• ibmdbR package Data frames logically representing data physically residing in Dynamite tables
> con <- idaConnect("BLUDB", "", "")> idaAnalyticsInit(con)> sysusage<-ida.data.frame('DB2INST1.SHOWCASE_SYSUSAGE')> systems<-ida.data.frame('DB2INST1.SHOWCASE_SYSTEMS')> systypes<-ida.data.frame('DB2INST1.SHOWCASE_SYSTYPES’)
Push down of R data preparation to Dynamite> sysusage2 <- sysusage[sysusage$MEMUSED>50000,c("MEMUSED","USERS")]> mergedSys<-idaMerge(systems, systypes, by='TYPEID')> mergedUsage<-idaMerge(sysusage2, mergedSys, by='SID’)
Push down of analytic algorithms to in-db execution> lm1 <- idaLm(MEMUSED~USERS, mergedUsage)
R StudioBrowser
Any R Runtime
ibmdbR
ibmdbR
Predictive Analytics With R In dashDB 2/3 Dynamite-native implementation of statistical functions
• colnames, cor, cov, dim, head, length, max, mean, min, names, print, sd, summary, var
Logically derived columns pushed down to Dynamite> myDF <- ida.data.frame('DB2INST1.SHOWCASE_SYSUSAGE')> myDF$MemPerUser <- myDF$MEMUSED / myDF$USERS
Sampling of tables in Dynamite> idaSample(myDF, 3)
SID DATE USERS MEMUSED ALERT MemPerUser1 8 2014-02-14 23:39:00.000000 34 5015 f 1472 5 2014-01-22 07:52:00.000000 96 11512 f 1193 7 2013-09-12 05:17:00.000000 39 5592 t 143
Statistics about tables in Dynamite> summary(myDF)
SID USERS MEMUSED ALERT MemPerUser
Min. :0.000 Min. : 3.000 Min. : 350.000 f :3655563 Min. :105.000
1st Qu.:2.000 1st Qu.: 35.000 1st Qu.: 5113.000 t :1344437 1st Qu.:135.000
Median :4.500 Median : 64.000 Median : 9455.000 NA's: NA Median :150.000
Mean : NA Mean : NA Mean : NA Mean : NA
3rd Qu.:7.000 3rd Qu.:111.000 3rd Qu.:16517.000 3rd Qu.:165.000
Max. :9.000 Max. :347.000 Max. :62379.000 Max. :209.000
Statistics about categorical values> idaTable(myDF)
ALERT f t 3655563 1344437
Predictive Analytics With R In dashDB 3/3 Store R objects in Dynamite database
> myPrivateObjects <- ida.list(type='private’)> myPrivateObjects['series100'] <- 1:100> x <- myPrivateObjects['series100’]> X [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 [23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 [45] 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 [67] 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 [89] 89 90 91 92 93 94 95 96 97 98 99 100> names(myPrivateObjects) [1] "series100”> myPrivateObjects['series100'] <- NULL
Manage Dynamite tables> idaExistTable('DB2INST1.SHOWCASE_SYSUSAGE') [1] TRUE> idaShowTables()
Schema Name Owner Type 1 BLUADMIN R_OBJECTS_PRIVATE BLUADMIN T 2 BLUADMIN R_OBJECTS_PRIVATE_META BLUADMIN T 3 BLUADMIN R_OBJECTS_PUBLIC BLUADMIN T 4 BLUADMIN R_OBJECTS_PUBLIC_META BLUADMIN T> myView <- idaCreateView(myDF)> idaIsView(myView) [1] TRUE> idaDropView(myView)> idaIsView(myView) [1] FALSE
The Analytics ChallengeBringing Analytics to the DataAnalytics with dashDBPredictive Analytics with RGeoSpatial Analytics with ESRIAnalytic Extension Framework
The Power of Place
• Spatial Awareness is a dramatically increasing property of big data due to mobile computing and Internet of Things
• Spatial Insight is directly available in dashDB through built-in spatial data type and operators, like for instance: WITHIN – E.g.: Show me the clients that are affected by a
power outage! OVERLAPS – E.g.: What are my cell phone customers who
are at risk of cell tower service outage due to upcoming tornados?
TOUCHES – E.g.: Give me the neighboring ZIP areas per customer for customized marketing campaigns!
DISTANCE – E.g.: List me the top 5 closest stores! DISJOINT – E.g.: What are candidates of insurance fraud
because a client submitted a claim from a different place than the case is for?
… and ~100 further operators• Supported and leveraged by ESRI – major spatial tooling
vendor
GeoSpatial Analytics In dashDB
• Implements ISO SQL/MM standard for spatial See
http://www.iso.org/iso/catalogue_detail.htm?csnumber=38651
• Spatial data type ST_GEOMETRY (hierarchy)• Enables spatial joins in database through spatial
operators available as user defined functions• Dedicated support in ESRI tools starting V 10.3 http://www.esri.com/software/arcgis/arcgis-for-desktop/free-trial
• GeoSpatial Applications Examples Telco Location Data Utilities Smart Grid GPS Tracking in Transportation Insurance Demographics Cable Marketing Campaigns Retail Store Placement
Examples of using ESRI ArcGIS with dashDB 1/3Load spatial data into dashDBDiscover & browse spatial data with ArcCatalog
Counties
Tornado paths over recent 50 years
Examples of using ESRI ArcGIS with dashDB 2/3Combine spatial data from dashDB into interactive maps with ArcMap
Examples of using ESRI ArcGIS with dashDB 3/3Perform spatial joins in dashDB using query layers and visualize results ArcMap
Tornado risk per county
41
Insurance Risk Analysis – Show case overview
Public spatial data sets available online- Historical tornados from 1950s to today
http://www.spc.noaa.gov/gis/svrgis/- Current tornado weather warnings
http://www.nws.noaa.gov/regsci/gis/shapefiles/- US counties
https://www.census.gov/geo/maps-data/data/tiger-line.html
Mobile application generating
spatial data for insurance claims for tornado damage
Cloud warehouse service for analytics and correlation
between customer data and public or third party data
Visualization and spatial analysis capabilities by
Esri ArcGIS
www.bluemix.net
www.cloudant.comdashDB
Cloud service for persistency of
system of engagementInsurance Master Data (customers)
© 2010 IBM Corporation
Information Management
Twitter-dashDB Show Case (www.youtube.com/watch?v=9yVNwOs9L4c)
http://american-sniper-analysis.mybluemix.net
The Analytics ChallengeBringing Analytics to the DataAnalytics with dashDBPredictive Analytics with RGeoSpatial Analytics with ESRIAnalytic Extension Framework
The Two Elements of Analytic Extension Framework1. User Defined Extension – UDX – C++ API
Three types of UDXs:• Scalar Functions
SELECT MyXForm(Col1, Col2) FROM MyTab
• Aggregate FunctionsSELECT Col1, MyAgg(Col2) FROM MyTab GROUP BY Col1
• Table FunctionsSELECT b.MyCol1 FROM MyTab a, TABLE(MyTableFunc(a.Col1, a.Col2)) AS b
C++ code compiled and linked within dashDB service Registered via DDL, e.g.
CREATE FUNCTION MyXForm(VARCHAR(ANY), INTEGER) RETURNS VARCHAR(ANY) LANGUAGE CPP PARAMETER STYLE NPSGENERIC EXTERNAL NAME ’mylib.so!cMyFunc’CREATE FUNCTION MyAgg(INTEGER) LANGUAGE CPP RETURNS DOUBLE AGGREGATE WITH (SUM INTEGER) PARAMETER STYLE NPSGENERIC External Name 'mylib.so!cMyAgg'CREATE FUNCTION MyTableFunc(VARARGS) RETURNS TABLE (Col1 INTEGER) LANGUAGE CPP PARAMETER STYLE NPSGENERIC External Name 'mylib.so!cMyUDTF’
2. REST API & tooling for development & deployment:pushFile, pullFile, executeCC, compile, link, promote, createPackage, deployPackage, getProjList, getFileList, executeDDL, executeSQL, dropUDX, ...
class cMyFunc: public nz::udx_ver2::Udf{public: cMyFunc(UdxInit *pInit) : Udf(pInit) { } static nz::udx_ver2::Udf* instantiate(UdxInit *pInit);
virtual nz::udx_ver2::ReturnValue evaluate() { int int1= int32Arg(0); int int2= int32Arg(1); int retVal = int1 * int2;
NZ_UDX_RETURN_INT32(retVal); }};nz::udx_ver2::Udf* cMyFunc::instantiate(UdxInit *pInit){ return new cMyFunc(pInit);}
User Defined Scalar Function API Example
class cMyAgg: public nz::udx_ver2::Uda{
public: GenericSum(UdxInit *pInit) : Uda(pInit) { } static nz::udx_ver2::Uda* instantiate(UdxInit *pInit);
void initializeState() { int64 *s = int64State(0); *s = 0; setStateNull(0, false); } //Accumulate data in states. virtual void accumulate() { if (isArgNull(0)) return; int64 *s = int64State(0); *s += int16Arg(0); } //States flowed in as input; Merge back in state virtual void merge() { accumulate();
} //Merged data copied to input virtual ReturnValue finalResult() { if (isArgNull(0)) NZ_UDX_RETURN_NULL(); setReturnNull(false); NZ_UDX_RETURN_INT64(int64Arg(0)); }};
User Defined Aggregate Function API Example
class OneUdtf : public nz::udx_ver2::Udtf{private: int32 argInt, xcount;public:
static nz::udx_ver2::Udtf* instantiate(UdxInit *pInit);
OneUdtf(UdxInit *pInit) : Udtf(pInit) { }
static nz::udx_ver2::Uda* instantiate(UdxInit *pInit);
virtual void newInputRow(){ argInt=0;for (int i = 0; i < numArgs(); i++){
if(argType(i) == UDX_INT32){argInt = int32Arg(i);
}else{
throwUdxException( "Unknown type");}
}xcount = 1;
} virtual DataAvailable nextOutputRow(){
if (xcount > 5)return Done;
for (int i=0; i < numReturnColumns(); i++) {setReturnColumnNull(i, false);if (returnTypeColumn(i) == UDX_INT32){
*int32ReturnColumn(i) = argInt + xcount;}else{
throwUdxException( "Unknown type");}
}xcount++;return MoreData;
}};
User Defined Table Function API Example
dashDB
push File
REST API
.cpp
compile
.o
pro mote
exeuteDDL
Command Line 3rd Party IDEs
Cloud Web IDE
release
create Package
LogsLogs
pull File
.o .o
Run SQL
.cpp.cpp BLUDB
Catalog
dashDB Developer
Setup
Analytic Extension Development Process
unde
r con
sider
ation
unde
r con
struc
tion
DRDA
link
.so
.zip
deploy Package
Some Examples Highlighting the REST APILogin and keep a cookie for the sessioncurl -d j_username=<User> -d j_password=<PW> https://<IP>:8443/services/loginService -c ck.dat
Upload source filescurl –F cmd=pushFile –F proj=udsf1 –F subDir=src --form "file[0]=@./udsf1.cpp" --form "file[1]=@./opr.cpp" --form "file[2]=@./opr.h" https://<IP>:8443/ida -b ck.dat
Compile source filescurl –d cmd=compile –d proj=udsf1 –d targetDir=bin -d "files={\"files\":[\"src/udsf1.cpp\”]}” https://<IP>:8443/ida -b ck.dat
Link object filescurl –d cmd=link–d proj=udsf1 –d targetDir=bin -d "files={\"files\":[\"bin/udsf1.o\“]}“ https://<IP>:8443/ida -b ck.dat
Alternatively: low-level cc invocationcurl –d cmd=executeCC –d proj=udsf1 -d "args=-m64 -Wall -fPIC -c -D_CPLUSPLUS src/udsf1.cpp -I/mnt/blumeta0/home/db2inst1/sqllib/include -o udsf1.o” https://<IP>:8443/ida -b ck.dat
Promote linked binaries to release directorycurl –d cmd=promote –d proj=udsf1 -d "files=lib*.so“ https://<IP>:8443/ida -b ck.dat
Register UDX with DDLcurl –d cmd=executeDDL –d profileName=BLUDB -d "ddl=CREATE FUNCTION udf1(INT) RETURNS INT LANGUAGE CPP PARAMETER STYLE NPSGENERIC FENCED EXTERNAL NAME '/mnt/blumeta0/home/bluadmin/projects/udsf1/release/libudsf1.so!CUdf';" https://<IP>:8443/blushiftservices/BluShiftHttp.do -b ck.dat
A Proof Point of UDX Support in dashDB
We have working prototype of the entire the Netezza SQL Extension Toolkit for dashDB !!