nrb - luxembourg mainframe day 2017 - data spark and the data federation

© 2017 IBM Corporation

Data: Spark and the Data Federation

Leif Pedersen

Executive IT Specialist,

z Analytics, Europe

Email: [email protected]


Systems of InsightSystems of RecordSystems of

Engagement

Look like a “déjà vu”?

2


In the new insight economy, winners infuse analytics everywhere to drive better outcomes!

Create new business models(CEO)

Attract, grow, retain customers(CMO)

Transform financial& management

processes(CFO)

Manage risk(CRO)

Prioritize IT investmentfor innovation(CIO, CDO)

Optimize operations

(COO)

Fight fraud and counter threats

(CSO)

Systems of Insight

Systems of RecordSystems of

Engagement

3


All Data New Dev StylesNew Analytics More People

Business Value

Embrace all data

Run at the speed of business

1 Enable all analytics

IBM Analytics Point of View - Make DATA SIMPLE and ACCESSIBLE to ALL

DATA Professionals are

leading THE Transformation!

2

3

4


The Evolution in the Approach to Getting Value from Data

Operations Data Warehousing Self-service Analytics

New Business Imperatives

Maturity High

High

Low

Data-Informed Decision Making

• Full dataset analysis (no more sampling)

• Extract value from non-relational data

• 360o

view of all enterprise data

• Exploratory analysis and discovery

Warehouse Modernization

• Data lake

• Data offload

• ETL offload

• Queryable archive and staging

Lower the Costof Storage

Ensure resiliency and availability

Business Transformation

• Create new business models

• Risk-aware decision making

• Fight fraud and counter threats

• Optimize operations

• Attract, grow, retain customers

Value

We are here

5


SoE

Analytics evolution to support all Analytics Apps on all Data –The Mainframe Use case

6

Applications Data

SoI

HDFSMap / Reduce

SparkHistorical data in DB2 for z/OS &

IBM DB2 Analytics Accelerator

Other Data

BI Reporting Data Warehouse / Data Marts

The Data Lake Evolution

Operational Data stored in

VSAM, IMS, DB2

SoR Core Business supported by

CICS, IMS, WAS

z/OSRulesScore

execution

Machine LearningThe Predictive Analytics Evolution Score

Creation

IT Operational Data


z Systems Analytics Areas complement existing Analytics Environments.

IBM

DB

2 A

naly

tics

Accele

rato

r

In transaction rules and score execution

Intraday capability for ad-hoc queries & predictive analytics

Availability of historical data (in raw format)

Accelerated reporting to fulfill internal and regulatory

requirements

Ability to transform data before offload to

DWH or reportingAbility to create new models at any time

Quasi Real Time availability of data

for analytics

Instant access to raw data for new report generation in

hours instead of days

Load and merge of ANY non DB2 z/OS data

Scoring Rules

A

zDatazApps

Scoring

Rules

Explore data to uncover hidden

insights

A

7


� Opportunity to rethink business processes: analytics as an integral part of the process itself, rather than a separate activity performed after the fact

o Transform business processes, not just provide existing styles of analytics faster and without latency

� Enable business leaders to perform, in the context of operational processes, advanced and sophisticated real-time analysis of their business data

Hybrid transaction/analytical processing will empower application leaders toinnovate via greater situation awareness and improved business agility.

Gartner Research Note G00259033 28 January 2014: Hybrid Transaction/Analytical Processing Will Foster Opportunities for Dramatic Business Innovation

The integration of transactions and analytics is an emerging and important market segment

“”

Analytics as part of the

flow of business

Insights on every

transaction


Hybrid Transaction/Analytical Processing (HTAP) - with DB2 Analytics Accelerator

OLAP

DB

2 fo

r z/O

S

Pro

cessin

g

IBM

DB

2 A

nalytic

s A

ccele

rato

r

DB2 for z/OS CPU savings target• Operational (in transaction)

analytics

• (complex) OLTP

Accelerator focus• Ad-hoc queries

• Complex queries scanning

large amount of data

• ETL acceleration/virtual

transformation

Complex queries (more history)

OLTP Transactions

High concurrency

Hybrid Transactional & Analytical Processing

Standard reports


Data Warehouse and Data Lake

A Data Lake is…

+An analytics sandbox for exploring data to gain insight

+An enterprise-wide catalog to find data across the enterprise and to link from business term to technical metadata

+An environment for enabling reuse data transformations and queries

+An environment where users can access vast amounts raw data

+An environment for developing and proving an analytics model and then moving into production; experience in production may drive further experimentation in the data lake

A Data Lake is not…

- A data warehouse or data mart of all of the data in an enterprise

- A high-performance production environment

- A production reporting application

- A purpose-built system to solve a specific problem

10


� Fast Runtime Environment– Interactive or batch processing

– Based on data in-memory processing• High performance for multi-step processes where Spark can

pass the data directly without using disk storage.

– Parallel processing

� Interface to Data – Accessing Hadoop based HDFS data, Cassandra,

Hbase, …

– Accessing any traditional databases using JDBC

� Interface for Applications – Ease of Use APIs supported by modern languages

– Stack of libraries including SQL, Machine Learning,

GraphX, and Spark Streaming

– Over 80 high-level operators that make it easy to build

parallel applications

– Many languages supported including Java, Scala, Python

and R• Spark is actually written in Scala

Spark, a Transaction Manager for Analytics Applications

11

Spark is NOT a datastore, NOT a replacement for Hadoop!


2. Spark lets you develop line-of-business applications faster

3. Spark learns from data and delivers in real time

With Hadoop, you ask a question and get back a batch of data. With Spark, you may say, “continue to give me answers to this question”…and when new data comes, the user is smarter.

1. Spark makes it easier to access and work

with all data

- Enables new data-based use cases

- All data: Internal/ External, Structured/

Unstructured

- Real-time insights, from all data

sources

- Automates analytics with Machine

Learning

- Clients that lead in data, lead their

industry

DesignDevelop

ment

Data

Science

Why Spark matters to a business?

12


VSAM

z/OSKey

Business

Transaction

& Batch

Systems

Spark Applications: IBM

and Partners

AdabasIMSDB2 z/OS

Distributed

Teradata

HDFS

Apache Spark Core

Spark

Stream

Spark

SQLMLib GraphX

RDDDF

RDD

DF

Optimized data access

IBM z/OS Platform for Apache Spark

and *many* more . . .

Spark can run on z/OS close to z/OS-based Applications & Data

Values:Data-in place analytics, without need to ETL or move data for analytic purposes

Optimized access and z/OS governed ‘in-memory’ capabilities for core business data

Unique capability to access almost all z/OS sources with Apache Spark SQL & many non-z data sources

Almost all zIIP eligible

Integration of analytics across core systems, social data, website information, etc.

13

and *many* more including SMF, OPERLOG, SYSLOGs, . . .

© 2017 IBM Corporation14

Examples of Spark Use Case


� Client Insight Analytics over transactions & customer interactions

� Leverage data on z/OS (DB2, VSAM) & distributed (Oracle, SQL Server, HDFS) to enable real-time access from data

science teams focused on client insight to develop patterns, models

� Data Distillation - Hybrid Architecture

� Run Spark z/OS to access, aggregate, filter and *distill* large volumes of data

� Make available smaller, aggregated analytic results for access by: customer insight solutions, data science environments

� 360 Degree View: Customers, Payments, Transactions

� Leverage Spark z/OS to get real-time or near real-time view of current status of payments, transactions, customers combining data from OLTP, distributed sources, & streaming

� IT Analytics

� Analyze real-time streamed SMF data, combined with archived SMF data and syslog data, visualize and interact with data

science Jupyter Notebook to find patterns

Use Case Patterns


Distill the Data: • Use Spark z/OS for data blending, cleansing, transform, etc with data-

in-place• Store results in ‘Tidy’ Data Repository • Refresh as needed

Explore the results� Data exploration, investigationleveraging ‘Tidy’ Repository

Values:• Leverage most current business data for data science• Efficiencies in reducing ETL • Leverage common analytics ecosystem skill • Integrate Spark on multiple platforms for optimal analytics infrastructure

Use Case #1: Hybrid Data Science


Use Case #2: Optimized Customer Insight

Customer

z/OS

Transaction Merchant

Spark Analytic Result Set

Call Center

Apache Spark Core

Spark

Stream

Spark

SQLMLib GraphX

RDDDF

RDD

DF

Optimized data Layer

IBM z/OS Platform for Apache

Spark

Subset of Data: distilled, filtered, transformed

BIDashboard

Components

DataCube

AnalyticalEngines

WebPortal

Analytics

AP

I G

ate

wa

y

APIs

Pre-BuiltDashboards

Pre-BuiltData Models

Pre-BuiltAnalytical Models

Transform (if needed), &

populate BBCI staging area /

cache

Input &

Output

Tidy Data

Values:• Avoid costly and ineffective wholesale copy of data• Frequent refresh of most relevant data elements to customer insights solution• Faster time to implementation for business solution to deliver insights on churn, cross-

sell, etc.

Customer Insight for Banking Solution


Use Case #3: Real-Time Application Event Analytics Use Case

Spark z/OS

Event Stream

� CICS Event triggers create an event stream that would

be captured by Spark running on its own z/OS LPAR

� Spark configured for high availability to avoid impacting CICS

� Real-Time Analytics with Spark z/OS:

� Real time analytics to provide feedback into the

Systems of Engagement or Monitoring Systems on types of banking services and frequency of

consumption

� Real time monitoring of core business processes and applications

� Historical Analysis leverages IDAA:

� Batch Load of Events for historical, trending and

reporting

Real Time

Analytics, can

include scoring

DB2 Analytics Accelerator

Loader

Channel

System of Engagement

CICS Transactions

Monitor

LogstreamLogstream

IBM DB2 Analytics Accelerator

Real-Time Consumption Batch Load Overnight

Historical

Analysis, Reporting

DB2 z/OS


Use Case #4: Surface Spark Results to JDBC / ODBC Applications

DB2 z/OS

z/OS

Apache Spark Core

Spark

Strea

m

Spark

SQLMLib

Graph

X

DFRDD

DF

RDD

DFStor

• Persist

specific Spark

Result

Sets

• Backed

by VSAM • Leverage

z/OS SAF,

Dataset

mgmt

HDFS

JDBC / ODBC / REST, noSQLClient accessing Spark RDDs, example: Cognos , Tableau, …

Optimized Data Layer

IMSVSAM


Use Case #5: Analyzing SMF Data with Spark

• Spark application is

agnostic to data source

and number of sources

• MDSS required on at

least one system, MDSS

agents required on all

systems. No IPL required

for installation

• Logstream recording

mode required for

realtime interfaces

MDSS Client

LPAR1

MDSS Client

LPAR2

MDSS Client

LPAR3

SMF

Realtime

LogstreamLogstream

Logstream SMF

Realtime

LogstreamLogstream

Logstream SMF

Realtime

LogstreamLogstream

Logstream

Spark Application using SparkSQL

Optimized Data Integration Layer (MDSS)

JDBC

LPARn

SMF

Realtime

LogstreamLogstream

Logstream

Dump Data Sets

�Analyze real-time in-memory SMF data, combined with archived data

�Analyze data across multiple LPARs

�Augment with SYSLOG and other sources for richer analytic outcome

�Efficiencies in avoiding data movement


Use Cases for Real Time SMF Analytics

� Detect excessive memory consumption – SMF30

Monitor high water mark for real memory usage for jobs and send alerts if usage exceeds normal consumption

� Detect security violations in real-time – SMF 80

Monitor volume of datasets/files accessed per user within a given time period and raise alerts for above normal access rates

� Real time monitoring resource usage in cloud environments (CPU, Memory, Disk)

A list of supported SMF record types can be found in the Redbook “Apache Spark Implementation on IBM z/OS” - page 78

http://www.redbooks.ibm.com/abstracts/sg248325.html


IBM Open Data Analytics for z/OS


Business Applications

CustomerTransaction Merchant

Distributed

Apache Spark

Distilled Insight

Query

Acceleration

Leveraging IBM Z for Optimized Analytics

Federate analytics leveraging data in place for more current insights at scale,

optimized security, privacy and reduced costs

DataDataData PrepData Prep

ML AlgoML

AlgoModelModel DeployDeploy PredictPredict

Python

Distilled InsightAnalytic Result

Sets

Govern, Manage, Algorithm Assist…

Monitor, Feedback

Pauselss GC

New SIMD instructions 32 TB MemoryPervasive Encryption

23


IBM Machine Learning for z/OS

Optimized Data Integration Layer


IBM Open Data Analytics for z/OS: Offering Overview

What is in the Offering?

IBM Open Data Analytics for z/OS (IBM

product):• Apache Spark 2.1.1 enabled for z/OS

• Python 3.6.1

• All Pre-requisite libraries

• Select Anaconda Libraries (approx. 250 including

pandas, dask, numpy, scikit-learn, matplotlib…)

• Optimized Data Integration Layer: optimized for

Spark & Python db access to z/OS data

• Integration with WLM z/OS for resource

management aligned with job priority

• Integration with security (SAF) interfaces

• Support & Service available from IBM for a fee

–Very aggressive pricing for zIIPs (cores) and memory for

Open Data Analytics z/OS workload

Ecosystem

–GitHub zos-spark repository• Jupyter Notebooks (Scala, Python Workbenches)

• Kernel gateway, Jupyter client, kernel toree

• Sample data & code snippets

–Rocket: • Collaboration for Optimized Data Layer

• Industry vertical mappings, e.g. ISO8583-1, ACH,

SMF, etc.

–Continuum:

• Access to z/OS channel on Anaconda cloud for

updates / refreshes & Package management

• Option to license private mirrored environment

• Services & Consulting for Python


Value: Increase Integration �� through Persisting Analytic Results for Enterprise Collaboration

VSAM

z/OS

DF Store:• Specific

Spark &

Python

Result

Sets

• Backed by

VSAM

• Leverage

z/OS SAF,

Dataset

mgmtOptimized Data Layer

Apache Spark Core

Spark

Stream

DF DF

MLib GraphxSpark

SQL

Python 3.6.1Core Packages:• numpy• scikit-learn• dask• pandas• Matplotlib• Etc.

IMSDB2 z/OS

HDFS

JDBC / ODBC / REST, noSQLClient accessing Spark RDDs, example: Cognos , Tableau, …


nrb - luxembourg mainframe day 2017 - data spark and the data federation

Devices & Hardware