what's new in sap hana sps 11 predictive

24
1 2014 SAP AG or an SAP affiliate company. All rights reserved. SAP HANA SPS 11 - What’s New? Advanced Analytics & Predictive Analysis Library SAP HANA Product Management December, 2015 (Delta from SPS 10 to SPS 11)

Upload: sap-technology

Post on 14-Apr-2017

1.073 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: What's New in SAP HANA SPS 11 Predictive

1© 2014 SAP AG or an SAP affiliate company. All rights reserved.

SAP HANA SPS 11 - What’s New? Advanced Analytics & Predictive Analysis Library

SAP HANA Product Management December, 2015(Delta from SPS 10 to SPS 11)

Page 2: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 3

Agenda

Topics Overview What’s New in SAP HANA SPS11 Predictive Analysis Library New algorithms in the Predictive Analysis Library

– Incl. Demo Random Forest

Enhancements to algorithms in the Predictive Analysis Library

Page 3: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 4

SAP HANA SPS11 – New Features and EnhancementsSAP HANA Predictive Analysis Library

New Algorithms Classification algorithms Random Forest

– Popular and fast ensemble learning method for classification or regression scenarios. Random forest iterate over a series of individual trees in parallel delivering robust predictive models

Embedding predictive algorithms like incremental classification and clustering functions withinSAP HANA Smart Data Streaming

– Machine learning algorithms in streaming environments can learn from and make predictions based on incoming data in real time. For both scenarios supervised and unsupervised learnings, the algorithms have specifically optimized to deal with streaming data.

Other new functions and statistics

– Survival Analysis Statistics Procedure Statistics(Kaplan-Meier Survival Analysis)

– Area under curve (AUC) method to evaluate the performance of classification algorithms based on receiver operating characteristic (ROC) curves.

– Cluster assignment method can be used to assign data to the clusters that were previously generated by some clustering methods such as K-means, DBSCAN and SOM.

– The Binning assignment function is used to assign data to the bins previously generated by the Binning algorithm.

– Posterior scaling is used to scale data based on the previous scaling model generated by the scaling range procedure. It is assumed that new data is from similar distribution and will not update the scaling model.

Page 4: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 5

SAP HANA SPS11 – New Features and EnhancementsSAP HANA Predictive Analysis Library

Enhancements General enhancements

– BigInt datatype support for ID columns with various PAL functions

Enhancements to algorithms– Logistic Regression supporting for BigData scenarios

leveraging stochastic gradient decent (SGD) as an alternative optimization method designed to address extremely large training input data.

– Forecast Smoothing-algorithm enhanced witho support for Mean Absolute Percent Error (MAPE), a widely

used forecasting optimization indicatoro Support for limited-memory BFGS (L-BFGS-B) as an

alternative forecasting parameter estimation method o Support for additional trend dumping algorithm optionso Enable train and test data separation and return evaluation

results on test data

contnd.

– Apriori and FP-growth association algorithms support for an “UBIQUITOUS” parameter for filtering out highly frequent items

– Further enhancements to the following algorithmso Principal Component Analysis (PCA), Self-Organizing Maps

clustering, Gaussian Mixture Model (GMM) clustering, Distribution Fitting, etc.

Page 5: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 6

SAP HANA SPS11 – New Features and EnhancementsSAP HANA Predictive Analysis Library – Random Forest

New Algorithms Classification algorithms: Random Forest

– Popular and fast ensemble learning method for classification or regression scenarios.

– Random ForestoRuns a series of classification or regression models

over random (bootstrap samples) from the data oCombines and fits those results by voting (classification)

or averaging (regression)oResulting in robust and high prediction quality models.

Random Forests are one of the most powerful, fully automated, machine learning techniques. With almost no data preparation or modeling expertise, analysts can effortlessly obtain surprisingly effective models. “Random Forests” is an essential component in the modern data scientist’s toolkit and in this brief overview we touch on the essentials of this groundbreaking methodology.

A random forest of many decision trees

Page 6: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 7

SAP HANA SPS11 – New Features and EnhancementsSAP HANA Predictive Analysis Library – Random Forest

New Algorithm – Random Forest Random Forests algorithm logic

– Grow a forest of many trees. (Default is 500)– Grow each tree on an independent bootstrap sample

(with replacement) from the training data. – At each node:o Select m variables at random out of all M possible variables

(independently for each node). o Find the best split on the selected m variables.

– Grow the trees to maximum depth.– Vote/average the trees to get predictions for new data.

Advantages of Random Forests

– Applicable to both regression and classification problems.

– Handle categorical predictors naturally.– Computationally simple and quick to fit, even for large

problems.– No formal distributional assumptions (non-parametric).– Can handle highly non-linear interactions and

classification boundaries.– Automatic variable selection.– Handle missing values

Result accuracy & robustness– Accuracy – Random Forests is competitive with the best

known machine learning methods – Do not overfit when fit more trees– Out-of-bag (oob) error gives estimate of test set error, cross-

validation not necessary

Page 7: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 8

SAP HANA SPS11 – New Features and EnhancementsSAP HANA Predictive Analysis Library – Random Forest

New Algorithm – Random Forest Random Forest use in

Application Function Modeler dataflowgraphs

– All SPS11 new functions willbe made available in the AFMpalette in a fix revision for SAP HANA Studio

*SPS11 AFM Function Palette Update will be made avaialble with

fix revision for SAP HANA Studiovariable importance

Page 8: What's New in SAP HANA SPS 11 Predictive

DemoSAP HANA Predictive Analysis Library – Random Forest Demo

Page 9: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 10

Predictive Analysis Library – New Algorithm Random Forest Demo Predict whether income exceeds $50K/yr based on census data.

Also known as "Adult" dataset.

– Features: age, work class, education, marital status, occupation, race, sex, native countries, etc

– PAL parameters: TREES_NUM (default=500), TRY_NUM (default=sqrt(#feature) for classification, (#feature)/3 for regression)

– Output: Model in JSON, Variable importance, error rate (classification) / mean square error (regression), confusion matrix (classification)

Page 10: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 11

SAP HANA SPS11 – New Features and EnhancementsSAP HANA Predictive Analysis Library – Classification Evaluation

New Algorithms New evaluations functions Area Under Curve (AUC)

– Receiver operating characteristic (ROC) curve is the most commonly used way to visualize the performance of a binary classifier by plotting the true positive rate (TPR) against the false positive rate (FPR) at several thresholds*

– The shape of a ROC curve and the area under curve (AUC) method is (arguably) the best way to summarize the performance of classification algorithms visualized in ROC curves in a single number.o An area under the ROC curve value between 0.5 and 1

describes a positive discriminative selectivity of true positives against false negatives. Depending on the use case, values starting at 0.6 are regarded indicating valuable discriminating effects.

*The classification result based on each observation predicted probability will be different depending on the classification probability thresholds chosen for binary classification cases. e.g, if a probability value is larger than threshold 0.8, or 0.75, or 0.6, or …, the prediction is classified as true.

ROC example curve for a given classification model

AUC for the respective example

Threshold=0.8

Threshold=0.5

Threshold=0.3

Logi

stic

Reg

ress

ion

Input Data from training output

With threshold set to 0.8, ID 3 would be classified as a false positive case.

Page 11: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 12

SAP HANA SPS11 – New Features and EnhancementsSAP HANA Predictive Analysis Library – New Assignment Functions

New Algorithms New functions supporting Assignment of New Data

– Cluster assignment method can be used to assign new data to the clusters that were previously generated by the clustering methods such as K-means, DBSCAN and SOM.

– The Binning assignment function is used to assign data to the bins previously generated by the Binning algorithm.

– Posterior scaling is used to scale new data based on the previous scaling model generated by the scaling range procedure. It is assumed that new data is from similar distribution and will not update the scaling model.

Assigning new data values to given clusters

New items, assigned to cluster number and distance

Page 12: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 13

SAP HANA SPS11 – New Features and EnhancementsSAP HANA Predictive Analysis Library – Survival Analysis

New Algorithms New Survival Analysis Statistics Procedure

– The Kaplan-Meier Survival Analysis procedure calculates a survival probability estimate over time

– Besides probability estimate, confidence intervals are returned as well.

– Equality comparison of two or more Kaplan-Meier survival functions can be done using a statistical hypothesis test called the log rank.

Use cases – “In medical research, Kaplan-Meier estimate is often used to

measure the fraction of patients living for a certain amount of time after treatment.

– In other fields, Kaplan–Meier estimators may be used to measure the length of time people remain unemployed after a job loss, the time-to-failure of machine parts, or how long fleshy fruits remain on plants before they are removed by frugivores.”

Surival analysis plot

Page 13: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 14

SAP HANA SPS11 – New Features and EnhancementsSAP HANA Predictive Analysis Library – Logistic Regression

Enhancements Enhanced Logistic Regression for BigData

– Logistic Regression supporting stochastic gradient decent (SGD), a gradient decent optimization method for minimizing object function.o Training: Learn the coefficients () from the training data to

minimize the error. o Optimization: run many iterations to update , till convergenceo Standard logistic regression: each iteration requires FULL scan of

the datao SGD: each iteration requires only ONE/FEW samples of the data

Use SGD when training time is too long, e.g. related to using very large training data set- SGD converges much faster, but optimization error may not be

as well as minimized as with standard logistic regression

– Additionally table based output support for AIC statistics.

Page 14: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 15

SAP HANA SPS11 – New Features and Enhancements

SAP HANA Predictive Analysis Library – Time Series Algorithms

Enhancements “Forecast Smoothing”-algorithm enhancements

– Support for Mean Absolute Percent Error (MAPE) widely used forecasting performance/optimization indicatoro Easy interpretation as the forecasting

error as a percentage of the actual valueo As such free of scale, thus easier to use

to compare forecasting accuracy at different scales/levels

– Support for Limited-memory BFGS with simple bound constraints (L-BFGS-B) forecasting parameter estimation method (as an alternative to AHEAD)o Due to its resulting linear memory requirement, the L-BFGS method

is particularly well suited for optimization problems with a large number of variables / may be faster too

o L-BFGS-B offered as alternative/additional approach to optimize forecasting parameters, helps to achieve better overall forecasting accuracy

– Support for additional trend dumping

algorithm optionso DAMPED parameter option to select

Holt's linear or Holt’s Winter methodo When performing forecast calculations, every now

and then the system discovers a strong upward trend and, as a result, creates a very optimistic forecast. In reality, strong economical upward trends for example, in sales or market share, do not last long. A dampen upward trend adjust the over forecast problem.

– Specify the train and test data ratio for the whole time series analysiso And return evaluation

results based on test data

Page 15: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 16

SAP HANA SPS11 – New Features and EnhancementsSAP HANA Predictive Analysis Library – Association Analysis

Enhancements Enhanced Association Analysis

– Apriori and FR-growth association algorithms support for an “UBIQUITOUS” parameter for filtering out highly frequent items o Value 0-1 (default 1)

– Excluding highly frequent items, will increase the effectiveness of the association analysis and speed up performance of the overall analysiso By eliminating common association finding from the analysis

e.g. Exclude the plastic bag from the basket

e.g. excluding items with frequency greater 0.75

Page 16: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 17

SAP HANA SPS11 – New Features and EnhancementsSAP HANA Predictive Analysis Library – Misc Enhancements

Enhancements General enhancements

– BIGINT datatype support for ID columns with various PAL functions

Enhancements to algorithms– Parameter Selection and Model Evaluation (PSME)

support with for Random Forests on TRY_NUM

– Principal Component Analysis (PCA) supporting a projection function based on previous generated PCA results o Allows projecting new data into the PCA model eigenvectors

– Distribution Fitting additionally supports maximum Likelihood Estimation (MLE) for Weibull distribution fitting for a mixture of left, right, and interval censored data

– Self-Organizing Maps clustering algorithm supporting radius parameters, distance to BMU output supporto more output statistics on cluster results

– Gaussian Mixture Model (GMM) clustering algorithm optimized cluster model table output format

Page 17: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 18

SAP HANA SPS11 – New Features and EnhancementsSAP HANA Smart Data Streaming – Predictive Functions

Integration Enhancements Embedding of predictive algorithms incremental

classification (Adaptive Hoeffding Decision Tree) and clustering (DenStream) within SAP HANA Smart Data Streaming

Stream(push)

connect, query

SAP HANA Platform with enhanced smart data streaming

Incoming StreamsSAP HANA Platform

Streaming Service with predictive analytics and machine learning

Devices / IoT Gateway

Streaming Lite

Real-time insights and automated

decision-making

Features & Capabilities- Incorporate current events into prediction algorithms

immediately rather than periodic polling- Valid and up-to-date prediction model is maintained at all times

regardless of data drifts

Benefits- Instantly and progressively adapt to changing conditions and

behaviors- Re-imagine business models, products and services

Scenarios & Use Cases- In credit rating, respond to changing conditions real-time- In product recommendation, respond to changing behaviors and

patterns real-time

Page 18: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 19

SAP HANA – Recent Enhancements

SAP HANA Application Function Modeler / Flowgraph Editor

New capabilities and Enhancements SAP HANA Web-based Development Workbench

Flowgraph Editor– AFL Function transforms are available in web editor using

a generic AFL transform.– Use for PAL and other functions still in limited scope and

usability

SAP HANA Studio Flowgraph Editor

– An upcoming fix revision will provide the new Predictive Analysis Library Functions to the palette.

– Flowgraphs support for new Data Sources as input like SQL Views, Calculation View using Input Parameters

Page 19: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 20

SAP HANA SPS11 – New Features and EnhancementsSAP HANA and the Automated Predictive Library*

Integration Enhancements Automated Predictive Library utility procedures

– During manual development using APL, you no longer need to define a signature and create the AFLLANG stored procedure and table types, APL will generate the stored proc automatically for you

Recent enhancements (SAP HANA SPS10)

– Automated Predictive Library model training delegation to SAP HANA based on SAP Predictive Analytics 2.4 and SAP HANA SPS10

*Note: The Automated Predictive Library is not a included license component of the SAP HANA Platform.

Page 20: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 21

SAP HANA SPS11 – New Features and EnhancementsR Integration with SAP HANA

Enhancements Platform enhancements

– R Integration support for SAP HANA on Power

Recent enhancements (SAP HANA SPS10)

– Support SSL-encrypted communication channel between SAP HANA and the Rserve-environment (SSL/TLS from IIRC version 1.7)

– Support multiple R Server-connections o Enabling round-robin distribution/load balancing of connection calls,

e.g. in case there is bottleneck with one R Server – Support for SAP HANA design-time hdbprocedures of type

Rlang o Enables RLANG-procedures to be better integrated in overall SAP

application lifecycle management

–SAP HANA and R version compatibility and supportoUpdated support documented in SAP Note 2185029

R-ServerSAP HANA

R-call

Page 21: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 22Public

How to find SAP HANA documentation on this topic?SAP HANA Predictive Analysis Library

SAP HANA Platform (Core) What’s New – Release Notes Installation

– SAP HANA Server InstallationGuide

Administration– SAP HANA Administration Guide

Development– SAP HANA Developer Guide

References– SAP HANA Predictive Analysis Library (PAL) Reference

• In addition to this learning material, you can find SAP HANA documentation on SAP Help Portal knowledge center at http://help.sap.com/hana_platform.

• The knowledge center is structured according to the product lifecycle: installation, security, administration, development. You can find e.g. the SAP HANA Predictive Analysis Library (PAL) Reference in the References section.

Page 22: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 23Public

How to find SAP HANA documentation on this topic?SAP HANA Studio Application Function Modeler• In addition to this learning material, you find SAP HANA documentation on

SAP Help Portal knowledge center at http://help.sap.com/hana_platform.

• The knowledge center is structured according to the product lifecycle: installation > security > administration > modeling > development. So you can find e.g. the SAP HANA Developer Guide in the Development section and so forth …

Page 23: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 24Public

How to find SAP HANA demo examples on this topic?

• Go Online under https://www.youtube.com/user/saphanaacademy .

Page 24: What's New in SAP HANA SPS 11 Predictive

© 2015 SAP SE or an SAP affiliate company. All rights reserved.

Thank you

Contact information

Christoph MorgenSAP HANA Platform Product [email protected]

Xingtian ShiSAP Products and InnovationData Science Development