zspotlight: spark on z/os - mwdug - spark on zos.pdf · zspotlight: spark on z/os avijit...

16
© 2016 IBM Corporation Competitive Project Office 1 zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. [email protected] , @ChatterAvijit STSM, IBM Competitive Project Office

Upload: duongtram

Post on 04-Jun-2018

234 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

1

zSpotlight: Spark on z/OSAvijit Chatterjee, [email protected], @ChatterAvijitSTSM, IBM Competitive Project Office

Page 2: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

2

CEOs are increasingly focused on customers as individuals…

2015 IBM Global C-Suite Study: The CEO Perspective

…leveraging contextual mobile and cognitive technologies

Page 3: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

3

To focus on customers as individuals, real-time analytics is heavily leveraged across many industries

Government

Compliance: Score to detect non-compliant behavior and tax evasion

Social Services: Assess likelihood that individual will need multiple agency support to proactively engage various agencies to create best outcome and manage costs

Banking

Card: Use scoring to determine transaction risk based on spending history

Money laundering risk: Based on money wiring to multiple accounts keeping amount below threshold

Retail

Sales opportunity: Real-time scoring for target marketing

Page 4: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

4

Running Analytics on enterprise data off-platform doesn’t pay for a mainframe-centric business…

OperationalData

A large Asian bank:

One mainframe devoted exclusively to bulk data transfers

ETL consuming 8% of total distributed core and 18% of total MIPS

A large European bank: 120 database images created

from bulk data transfers 1,000 applications on 750 cores

with 14,000 software titles ETL consuming 28% of total distributed

cores and 16% of total MIPS

AnalyticalData

AnalyticalData

AnalyticalData

AnalyticalData

AnalyticalData

With this strategy, IT costs grow faster than business growthSource: IBM Eagle Studies

Page 5: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

5

… Rather it leads to significant data transfer costs

AnalyticalData

AnalyticalData

AnalyticalData

x86

1 TB of data transferred

per day

Source: IBM CPO internal study

Assuming 4 cores on z13 running at 85% utilization and 12 cores on x86 servers run at 45% utilization, transfer will burn 519 MIPS and use 10 x86 cores per day

Estimated 4 yr. cost summary

System costs = $9,864,412

Labor costs = $393,927

Total = $10,258,339

ODS

Example:

OperationalData

ETL Calculator: http://centerlinebeta.net/5023d/

Page 6: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

6

Spark

Spark

Linux on z Systems

Power

Spark

Spark

x86

CICS WAS

z/OS

Spark Spark

IMSSpark

Spark

Spark

Spark

DB2

VSAMDB2 z/OS

IMS DB

Physical Sequential

SMF

Log Streams

ADABASSyslog

Tape

Apache Spark powers a federated Analytics architecture keeping data in place

Unified analytics platform for structure and unstructured data

Flexibility & agility with multi-language support

Efficient structure – 100x vs. map reduce

Rich set of built-in functions with consistent APIs: Spark SQL, Spark MLib, Spark Streaming, GraphX, …

Cloud

SparkSpark

Page 7: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

7

Apache Spark empowers users to accelerate the insight economy

Data Scientist

Data Engineer

App Developer

“the convincer”

“the builder”

“the thinker”

What they want to do:

Identify patterns, trends, risks, and opportunities in data

Discover new actionable insights

How Spark can help:

Supports the entire data science workflow: from data access and integration, to machine learning, to visualization

Provides a growing library of machine learning algorithms

What they want to do:

Bridge between the Data Scientist and the App Developer

Implement machine learning algorithms at scale

Put the right data system to work for the job at hand

How Spark can help:

Abstract data access complexity (Spark doesn’t care what your data store is)

Enables solutions at web-scale

What they want to do:

Build applications that leverage advanced analytics in partnership with the data scientist and data engineer

Follow agile design methodologies

Optimize performance and meet SLAs

How Spark can help:

Supports the top analytics app languages such as Python and Scala

Eliminates programming complexity with libraries such as MLlib and simplifies DevOps

Makes it easy to embed advanced analytics into applications

Page 8: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

8

from http://spark.apache.org

Spark SQL: Provides capability to

perform relational queries via SQL (subset of HiveQL)

Mix SQL queries with Spark applications

Spark Streaming: Enables scalable, high-throughput

processing of live data streams Live stream ‘chopped’ into batches

based on time window

Spark MLIB Provides scalable machine learning library,

has common machine learning functions Provides classification, regression,

clustering, filtering, etc.

Spark GraphX Spark APIs for

graph style processing and iterative graph computations

Spark Core: Foundation providing task dispatching, scheduling, i/o Representation of Spark’s basic unit of data: RDD

Apache Spark is an operating system for Analytics

Page 9: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

9

Spark on z/OS joins multiple data types for fast, complete analytics, without moving the data

1 Scala query using sql syntax to access 3 data sources (over 1 billion rows of data) in 3 different formats

1.1 billion rows of source data

Filtered to 50M rows of Trades for brokers in 1 region

Summarized to show activity for each broker across the big 3 exchanges

Completed in < 2 minutes with 1 GP (60% utilized) and 11 zIIPs on a z13 LPAR with 512Gb memory

DB2 z/OS

Flat file

VSAM

z/OS

JDBC/AZK

JDBC/AZK

JDBC/AZK

NYSE data

Nasdaq data

S&P data

Use Case: Filtered data pull from 1 Billion rows, using Spark filtering to access 50 Million rows, and then summarize using Spark aggregation

Resource specs for run: 6 executors, 6Gb driver mem, 80Gb per executor, 4 vCores per executor (zIIPs in SMT-2), 12 concurrent threads, 512Gb total memory in LPAR, 13 zIIPs, 2 GPs. AZK accessing all data, spark unionAll to merge data. 55% of 1 GP utilized during peek resource consumption

On-platform Operational Analytics using Spark on z/OS

Page 10: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

10

Trade166GB

Brokerage aggregation query workload across Trades tables from 3

exchanges (over 5 Billion trades, 500GB)

On-platform Operational Analytics achieves 67% lower TCA

* 3-Year TCA includes 3-year US prices for Hardware, Software, Maintenance and Support as of 05/16/2016. Price and performance for x86 environment includes cost of ETL and elapsed time to transfer the data. This is based on an IBM internal study designed to replicate a typical IBM customer workload usage in the marketplace.

z13-606 + 11 zIIPs

z13-605 Competitor x86 System

Intel E5-2697 v2 2.7GHz 12co

lower TCA*For systems compared

67%

$2,105,990(3 yr. TCA)

$697,106(3 yr. TCA)

Linux

ApacheSpark

Parquet

z/OS

CICSDB2

z/OS

CICSDB2

ApacheSpark

ETL

Page 11: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

11

It doesn’t pay to move data to x86 to run Analytics beyond 150GB

$0

$500,000

$1,000,000

$1,500,000

$2,000,000

$2,500,000

$3,000,000

$3,500,000

$4,000,000

100GB 300GB 500GB 700GB 900GB

Comparing 3-year TCA factoring Cost of ETL

Spark on z/OS Spark on x86

Page 12: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

12

What is in the Offering?

IBM z/OS Platform for Apache Spark (IBM product):

• Apache Spark enabled for z/OS

• Optimized Data Integration Layer

• No License Charge product

• Support & Service available from IBM

–Very aggressive pricing for zIIPs and memory for Spark z/OS workload

–Lab Services – install, config, tune

–Jumpstart service for data science use cases

–DataFactZ – quick strike POCs for Spark analytic business applications

Ecosystem

–GitHub z/OS-Spark repository•Jupyter Notebook IDEs (Scala Workbench,

Interactive Insights Workbench) •Apache Job Server•Sample data & code snippets

–Rocket: •Industry vertical mappings, e.g. ISO8583-1 for

card data•In progress: “R” support

–DataFactZ: •Custom Solutions for banking & insurance

–Zementis:•Fast, Scalable, In-Transaction Predictive Scoring

integration Apache Spark

Spark on z/OS offering is now generally available

Page 13: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

13

Common use cases of Spark on z/OS

Analytics across OLTP & Warehouse information

Analytics combining business-owned data and external IoT and

social data

Analytics to improve system performance and operations in real-

time using streaming as well as archived data

Page 14: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

14

Why Spark on z/OS is best fit

Optimized and parallel access to almost all z/OS data environments and distributed data sources analyzing data in place

Spark memory structures with sensitive data are governed with z/OS security capabilities

Leverages z/OS memory management, compression, and RDMA communications to provide a high-performance scale up and scale out architecture.

Uses large pages, incorporating DRAM with large amounts of Flash as an attractive means to provide scalable elastic memory

Best fit analytic capability for the investments made in SMF in-memory analytics Leverages zEDC compression when compressing internal data for caching and

shuffling SMT2 and SIMD on select operations for enhanced performance Very high zIIP offload -- for affordability Intra-SQL and intra-partition parallelism for optimal data access

Page 15: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

15

Get started today:ibm.biz/zAnalytics

Seize the moment with Analytics on z Systems –adopt or be disrupted

Enhance customer experience by providing right product at the right place and right time

Locate analytics exactly where your data residesPower real time analytics with in-memory analytics

engine Spark on z/OS

Page 16: zSpotlight: Spark on z/OS - MWDUG - Spark on zOS.pdf · zSpotlight: Spark on z/OS Avijit Chatterjee, Ph.D. ... ETL consuming 8% of total distributed core and 18% of total MIPS A large

© 2016 IBM Corporation

Co

mp

etitive

Pro

ject O

ffic

e

16

IBM z Systems CPO ResourcesIBM Competitive Project OfficeOur mission is to perform hands-on competitive research that will enable sales and technical professionals to boost market awareness, increase pipeline and sales of IBM z Systems hardware and IBM software. Please visit our website for additional information on how we can help you generate leads and close business.

Customer BriefingBringing to life the results of IBM's lab-based, hands-on competitive research with your clients in face-to-face events.

"IBM z Systems – More Potential than Ever"

"IBM LinuxONE - Enabling IT to Solve Business Challenges" For First in Enterprise (FIE)

IBM CPO Competitive Sales Assists (CSA)A CPO CSA helps to make the advantages of IBM's solutions more real and compelling to a customer evaluating

competitive alternatives. Visit our website to learn how CPO provides one-to-one clients engagements to assist sellers

with closing competitive opportunities.

IBM CPO Communication• Share the external z Systems website with your clients.

• Click here to subscribe to IBM CPO Communications

• IBM Redbooks® Point-of-View and Redpaper publications – Authored by IBM CPO

• Read more about our research - CPO Technical Deliverables

• Join the CPO Community

For information on z Systems CPO Program or Events: Sally Touscany – [email protected]