zspotlight: spark on z/os - mwdug - spark on zos.pdf · zspotlight: spark on z/os avijit...
TRANSCRIPT
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
1
zSpotlight: Spark on z/OSAvijit Chatterjee, [email protected], @ChatterAvijitSTSM, IBM Competitive Project Office
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
2
CEOs are increasingly focused on customers as individuals…
2015 IBM Global C-Suite Study: The CEO Perspective
…leveraging contextual mobile and cognitive technologies
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
3
To focus on customers as individuals, real-time analytics is heavily leveraged across many industries
Government
Compliance: Score to detect non-compliant behavior and tax evasion
Social Services: Assess likelihood that individual will need multiple agency support to proactively engage various agencies to create best outcome and manage costs
Banking
Card: Use scoring to determine transaction risk based on spending history
Money laundering risk: Based on money wiring to multiple accounts keeping amount below threshold
Retail
Sales opportunity: Real-time scoring for target marketing
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
4
Running Analytics on enterprise data off-platform doesn’t pay for a mainframe-centric business…
OperationalData
A large Asian bank:
One mainframe devoted exclusively to bulk data transfers
ETL consuming 8% of total distributed core and 18% of total MIPS
A large European bank: 120 database images created
from bulk data transfers 1,000 applications on 750 cores
with 14,000 software titles ETL consuming 28% of total distributed
cores and 16% of total MIPS
AnalyticalData
AnalyticalData
AnalyticalData
AnalyticalData
AnalyticalData
With this strategy, IT costs grow faster than business growthSource: IBM Eagle Studies
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
5
… Rather it leads to significant data transfer costs
AnalyticalData
AnalyticalData
AnalyticalData
x86
1 TB of data transferred
per day
Source: IBM CPO internal study
Assuming 4 cores on z13 running at 85% utilization and 12 cores on x86 servers run at 45% utilization, transfer will burn 519 MIPS and use 10 x86 cores per day
Estimated 4 yr. cost summary
System costs = $9,864,412
Labor costs = $393,927
Total = $10,258,339
ODS
Example:
OperationalData
ETL Calculator: http://centerlinebeta.net/5023d/
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
6
Spark
Spark
Linux on z Systems
Power
Spark
Spark
x86
CICS WAS
z/OS
Spark Spark
IMSSpark
Spark
Spark
Spark
DB2
VSAMDB2 z/OS
IMS DB
Physical Sequential
SMF
Log Streams
ADABASSyslog
Tape
Apache Spark powers a federated Analytics architecture keeping data in place
Unified analytics platform for structure and unstructured data
Flexibility & agility with multi-language support
Efficient structure – 100x vs. map reduce
Rich set of built-in functions with consistent APIs: Spark SQL, Spark MLib, Spark Streaming, GraphX, …
Cloud
SparkSpark
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
7
Apache Spark empowers users to accelerate the insight economy
Data Scientist
Data Engineer
App Developer
“the convincer”
“the builder”
“the thinker”
What they want to do:
Identify patterns, trends, risks, and opportunities in data
Discover new actionable insights
How Spark can help:
Supports the entire data science workflow: from data access and integration, to machine learning, to visualization
Provides a growing library of machine learning algorithms
What they want to do:
Bridge between the Data Scientist and the App Developer
Implement machine learning algorithms at scale
Put the right data system to work for the job at hand
How Spark can help:
Abstract data access complexity (Spark doesn’t care what your data store is)
Enables solutions at web-scale
What they want to do:
Build applications that leverage advanced analytics in partnership with the data scientist and data engineer
Follow agile design methodologies
Optimize performance and meet SLAs
How Spark can help:
Supports the top analytics app languages such as Python and Scala
Eliminates programming complexity with libraries such as MLlib and simplifies DevOps
Makes it easy to embed advanced analytics into applications
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
8
from http://spark.apache.org
Spark SQL: Provides capability to
perform relational queries via SQL (subset of HiveQL)
Mix SQL queries with Spark applications
Spark Streaming: Enables scalable, high-throughput
processing of live data streams Live stream ‘chopped’ into batches
based on time window
Spark MLIB Provides scalable machine learning library,
has common machine learning functions Provides classification, regression,
clustering, filtering, etc.
Spark GraphX Spark APIs for
graph style processing and iterative graph computations
Spark Core: Foundation providing task dispatching, scheduling, i/o Representation of Spark’s basic unit of data: RDD
Apache Spark is an operating system for Analytics
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
9
Spark on z/OS joins multiple data types for fast, complete analytics, without moving the data
1 Scala query using sql syntax to access 3 data sources (over 1 billion rows of data) in 3 different formats
1.1 billion rows of source data
Filtered to 50M rows of Trades for brokers in 1 region
Summarized to show activity for each broker across the big 3 exchanges
Completed in < 2 minutes with 1 GP (60% utilized) and 11 zIIPs on a z13 LPAR with 512Gb memory
DB2 z/OS
Flat file
VSAM
z/OS
JDBC/AZK
JDBC/AZK
JDBC/AZK
NYSE data
Nasdaq data
S&P data
Use Case: Filtered data pull from 1 Billion rows, using Spark filtering to access 50 Million rows, and then summarize using Spark aggregation
Resource specs for run: 6 executors, 6Gb driver mem, 80Gb per executor, 4 vCores per executor (zIIPs in SMT-2), 12 concurrent threads, 512Gb total memory in LPAR, 13 zIIPs, 2 GPs. AZK accessing all data, spark unionAll to merge data. 55% of 1 GP utilized during peek resource consumption
On-platform Operational Analytics using Spark on z/OS
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
10
Trade166GB
Brokerage aggregation query workload across Trades tables from 3
exchanges (over 5 Billion trades, 500GB)
On-platform Operational Analytics achieves 67% lower TCA
* 3-Year TCA includes 3-year US prices for Hardware, Software, Maintenance and Support as of 05/16/2016. Price and performance for x86 environment includes cost of ETL and elapsed time to transfer the data. This is based on an IBM internal study designed to replicate a typical IBM customer workload usage in the marketplace.
z13-606 + 11 zIIPs
z13-605 Competitor x86 System
Intel E5-2697 v2 2.7GHz 12co
lower TCA*For systems compared
67%
$2,105,990(3 yr. TCA)
$697,106(3 yr. TCA)
Linux
ApacheSpark
Parquet
z/OS
CICSDB2
z/OS
CICSDB2
ApacheSpark
ETL
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
11
It doesn’t pay to move data to x86 to run Analytics beyond 150GB
$0
$500,000
$1,000,000
$1,500,000
$2,000,000
$2,500,000
$3,000,000
$3,500,000
$4,000,000
100GB 300GB 500GB 700GB 900GB
Comparing 3-year TCA factoring Cost of ETL
Spark on z/OS Spark on x86
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
12
What is in the Offering?
IBM z/OS Platform for Apache Spark (IBM product):
• Apache Spark enabled for z/OS
• Optimized Data Integration Layer
• No License Charge product
• Support & Service available from IBM
–Very aggressive pricing for zIIPs and memory for Spark z/OS workload
–Lab Services – install, config, tune
–Jumpstart service for data science use cases
–DataFactZ – quick strike POCs for Spark analytic business applications
Ecosystem
–GitHub z/OS-Spark repository•Jupyter Notebook IDEs (Scala Workbench,
Interactive Insights Workbench) •Apache Job Server•Sample data & code snippets
–Rocket: •Industry vertical mappings, e.g. ISO8583-1 for
card data•In progress: “R” support
–DataFactZ: •Custom Solutions for banking & insurance
–Zementis:•Fast, Scalable, In-Transaction Predictive Scoring
integration Apache Spark
Spark on z/OS offering is now generally available
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
13
Common use cases of Spark on z/OS
Analytics across OLTP & Warehouse information
Analytics combining business-owned data and external IoT and
social data
Analytics to improve system performance and operations in real-
time using streaming as well as archived data
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
14
Why Spark on z/OS is best fit
Optimized and parallel access to almost all z/OS data environments and distributed data sources analyzing data in place
Spark memory structures with sensitive data are governed with z/OS security capabilities
Leverages z/OS memory management, compression, and RDMA communications to provide a high-performance scale up and scale out architecture.
Uses large pages, incorporating DRAM with large amounts of Flash as an attractive means to provide scalable elastic memory
Best fit analytic capability for the investments made in SMF in-memory analytics Leverages zEDC compression when compressing internal data for caching and
shuffling SMT2 and SIMD on select operations for enhanced performance Very high zIIP offload -- for affordability Intra-SQL and intra-partition parallelism for optimal data access
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
15
Get started today:ibm.biz/zAnalytics
Seize the moment with Analytics on z Systems –adopt or be disrupted
Enhance customer experience by providing right product at the right place and right time
Locate analytics exactly where your data residesPower real time analytics with in-memory analytics
engine Spark on z/OS
© 2016 IBM Corporation
Co
mp
etitive
Pro
ject O
ffic
e
16
IBM z Systems CPO ResourcesIBM Competitive Project OfficeOur mission is to perform hands-on competitive research that will enable sales and technical professionals to boost market awareness, increase pipeline and sales of IBM z Systems hardware and IBM software. Please visit our website for additional information on how we can help you generate leads and close business.
Customer BriefingBringing to life the results of IBM's lab-based, hands-on competitive research with your clients in face-to-face events.
"IBM z Systems – More Potential than Ever"
"IBM LinuxONE - Enabling IT to Solve Business Challenges" For First in Enterprise (FIE)
IBM CPO Competitive Sales Assists (CSA)A CPO CSA helps to make the advantages of IBM's solutions more real and compelling to a customer evaluating
competitive alternatives. Visit our website to learn how CPO provides one-to-one clients engagements to assist sellers
with closing competitive opportunities.
IBM CPO Communication• Share the external z Systems website with your clients.
• Click here to subscribe to IBM CPO Communications
• IBM Redbooks® Point-of-View and Redpaper publications – Authored by IBM CPO
• Read more about our research - CPO Technical Deliverables
• Join the CPO Community
For information on z Systems CPO Program or Events: Sally Touscany – [email protected]