What is Spark?
• An Apache Foundation open source project; not a product
• An in-memory compute engine that works with data; not a data store
• Enables highly iterative analysis on large volumes of data at scale
• Unified environment for data scientists, developers and data engineers
• Radically simplifies the process of developing intelligent apps fueled by data
© 2015 IBM Corporation
Imagine the possibilities…
Fraud andrisk detection
Real-time traffic flow optimization
Accurate and timely threat detection
Understand and act on
customer sentiment
Low-latency network analysis
Predict and act on intent to purchase
© 2015 IBM Corporation
•© 2015 IBM Corporation4
Fast Leverages aggressively cached in-memory
distributed computing and JVM threads Faster than MapReduce for some workloads
Ease of use (for programmers) Written in Scala, an object-oriented, functional
programming language Scala, Python and Java APIs Scala and Python interactive shells Runs on Hadoop, Mesos, standalone or cloud
General purpose Covers a wide range of workloads Provides SQL, streaming and complex analytics
from http://spark.apache.org
Logistic regression in Hadoop and Spark
Apache Spark is…
IBM’s Commitment to Spark:
Founding Member of AMPLab
Establish Spark Technology Center
Contributing to the core
Open Source SystemML
Educate one million data professionals
•© 2015 IBM Corporation6
Brief History of Spark2002 – MapReduce @ Google
2004 – MapReduce paper
2006 – Hadoop @ Yahoo
2008 – Hadoop Summit2010 – Spark paper
2011 – Hadoop 1.0 GA
2014 – Apache Spark top-level
2014 – 1.2.0 release in December
2015 – 1.3.0 release in March
2015 – 1.4.0 release in June
2015 – 1.5.0 release in September
Activity for 6 months in 2014(from Matei Zaharia – 2014 Spark Summit)
Spark is Active
Most active project in Apache Software Foundation
One of top 3 most active Apache projects
Databricks founded by the creators of Spark from UC Berkeley’s AMPLab
Spark empowers users to accelerate the insight economy
Data Scientist
Data EngineerApp Developer
“the convincer”
“the builder”“the thinker”
What they want to do: Identify patterns, trends, risks, and
opportunities in data Discover new actionable insights
How Spark can help: Supports the entire data science workflow:
from data access and integration, to machine learning, to visualization
Provides a growing library of machine learning algorithms
What they want to do: Bridge between the Data Scientist and the App Developer Implement machine learning algorithms at scale Put the right data system to work for the job at hand
How Spark can help: Abstract data access complexity (Spark doesn’t care what your
data store is) Enables solutions at web-scale
What they want to do: Build applications that lever advanced
analytics in partnership with the data scientist and data engineer
Follow agile design methodologies Optimize performance and meet
SLAs
How Spark can help: Supports the top analytics app
languages such as Python and Scala Eliminates programming complexity
with libraries such as MLlib and simplifies DevOps
Makes it easy to embed advanced analytics into applications
•© 2015 IBM Corporation8
Spark’s basic unit of data
Immutable, fault tolerant collection of elements that can be operated on in parallel across a cluster
Fault tolerance– If data in memory is lost it will be recreated from lineage
Caching, persistence (memory, spilling, disk) and check-pointing
Many database or file type can be supported
An RDD is physically distributed across the cluster, but manipulated as one logical entity:
– Spark will “distribute” any required processing to all partitions where the RDD exists and perform necessary redistributions and aggregations as well.
– Example: Consider a distributed RDD “Names” made of names
MichaelJacquesDirk
CindyDanSusan
DirkFrankJacques
Partition 1 Partition 2 Partition 3
Names
Resilient Distributed Datasets (RDDs)
•© 2015 IBM Corporation9
Spark SQL– Provide for relational queries expressed in SQL, HiveQL and
Scala– Seamlessly mix SQL queries with Spark programs – Provide a single interface for efficiently working with
structured data including Apache Hive, Parquet and JSON files
– Standard connectivity through JDBC/ODBC– Rocket Software has announced intent to make
available a Rocket Mainframe Data Service for Apache Spark z/OS
Spark R– Spark R is an R package that provides a light-weight front-
end to use Apache Spark from R – Spark R exposes the Spark API through the RDD class and
allows users to interactively run jobs from the R shell on a cluster.
– Goal to make Spark R production ready – Rocket Software has announced intent to
support R on z/OS, already available for Linux on z
Common, popular methods to access data
© 2014 IBM Corporation
Spark Streaming
Run a streaming computation as a series of very small, deterministic batch jobs
Chop up live stream into batches of X seconds
Spark treats each batch of data as RDDs and processes them using RDD operations
The process results of the RDD operations are returned in batches
Combine live data streams with historical dataGenerate historical data models with SparkUse data models to process live data
Combine Streaming with MLlib algorithmsOffline learning, online predictions Actionable information
© 2014 IBM Corporation
Spark and Hadoop
– Spark analytics *can* use information from Hadoop based file systems – though not required
– Hadoop’s MapReduce processing is disk-dependent, Spark relies on in-memory capabilities; this can result in more ‘real-time’ capabilities for Spark
– Spark comes with a set of libraries and functions for streaming, SQL, graph, and machine learning; Hadoop has an ecosystem of other tools to support these (Storm, Hive, Giraph, Mahout) which are integrated separately
– Spark can use HDFS, but is not limited to particular data store format
Spark analytics
Spark analytics
Spark in-memory Spark Resilient Distributed Data (RDDs)Spark in-memory Spark Resilient Distributed Data (RDDs)
Hadoop Distributed File Systems (HDFS) Hadoop Distributed File Systems (HDFS)
Spark analytics
Spark analytics
Spark analytics
Spark analytics
Other File systems, DBs, etc.Other File systems, DBs, etc.
© 2014 IBM Corporation
Spark Transform actions across multiple data environments: apply a map function to each, filter, sample, union, intersect, joins…
Apply Spark Actions, for example Analytic Algorithms found in MLIB: clustering, PCA, logistic regression, classification, random forest…
Results of analysis
Data: from structured data, unstructured, any data environment supported by Spark
z Systems•DB2, IMS, VSAM•Adabas•Syslog•PDSE •Etc
Distributed:•Oracle•Filesystems•Hadoop •Etc
RDDRDDRDD’RDD’
Apache Spark – Analytics Across Heterogeneous Data Environments
Data did not need to move, centralize, or leave platform enforced security, governance….
•© 2015 IBM Corporation13
Highly technical skills. Limited uses.
Time consuming and complex.
Not focused on insight.
Inflexible and varied
programming models.
Analytics platforms difficult to configure, setup, and
deploy.
Steep learning curve, intense programming.
Not designed for data
scientists.
Separate and fragmented big data and
analytics work flows.
Complex & disjoint
platforms. Mix of workloads.
80% of time spend on data
preparation
Today’s solutions are not for agile data science and app development.Main focus is on data management and predictability.
© 2014 IBM Corporation
IBM z Systems analytics and Spark: well-matched
Faster analytics, both batch and real-timeFederated data access while preserving consistent APIs Historical, OLTP and streaming data Java based, can embed optimized nodes within transaction JVMsUse data in-place, avoiding unnecessary ETLsInfrastructure can be leveraged & optimized ‘underneath’ APIs .. e.g. zEDC, memory
management enhancements, SMT2, SIMDAvailable for Linux on z System now …..Available end of year for z/OS
Apache Spark is emerging as key to analytics ecosystem
File System SchedulerNetwork Serialization
MLlib
RDDcache
SQLGraphXStream
Compression
Look familiar?
© 2014 IBM Corporation
LoZ
Leverage LoZ virtualization benefits
Power
SparkSpark SparkSpark
SparkSpark SparkSpark
x86
Leverage call center, external, social, sentiment data …
DB2DB2 VSAM VSAM
z/OS
Leverage z/OS data and transactions
IBM DB2 Analytics Accelerator
z Systems & Apache Spark: Strategic Direction
IMSIMS
Fast, expressive, cluster computing system leveraging in-memory framework for analytics
Federation across platform Spark implementations will initially need external orchestration
Spark
Spark
Linux on z Systems
Spark
DB2
CICSCICS WASWASIMSIMS
SparkSparkSparkSparkSparkSpark
© 2014 IBM Corporation
Apache Spark on z/OS – Available Year End 2015
Key Business Transaction
Systems
Type 2Type 4
VSAM
Java
Apache Spark Core
Spark Stream
Spark SQL MLib GraphX
RDDRDD
RDDRDD
RDDRDD
RDDRDD
z/OS
Rocket Mainframe Data Service for Apache Spark z/OS
Spark Applications: IBM and Partners
and more . . .
SMFIMSDB2 z/OS
Value:Optimized access and z/OS governed ‘in-memory’ capabilities for core business data, leveraging open source analytic frameworks
Consistent analytic interfaces for SQL, Graph, Machine Learning across multiple data and system environments
Integration of analytics across core systems, social data, website information, etc.
NEW! Rocket Software Integrates with Apache Spark z/OS via the Mainframe Data Service for Apache Spark z/OS
Distributed
Oracle
HDFS
© 2014 IBM Corporation
Data: example DB2
Core transactions
z/OS Distributed
SMS gateway
Social Data
Analytics to Qualify
candidate for promotion
offer
CICS Banking Process:
•Process transaction•Score risk of fraud•Qualify & Initiate promotion offer
Optimized access and z/OS governed ‘in-memory’ capabilities for core business data, leveraging open source analytic frameworks
Consistent analytic interfaces for SQL, Graph, Machine Learning across multiple data and system environments
Almost entirely all zIIP eligible
Leverage of emerging Spark skills and commercial solution ecosystem built on Apache Spark for fast ROI and agility
Integration of analytics heterogeneous data across core systems, social data, website information, etc.
Optimized access and z/OS governed ‘in-memory’ capabilities for core business data, leveraging open source analytic frameworks
Consistent analytic interfaces for SQL, Graph, Machine Learning across multiple data and system environments
Almost entirely all zIIP eligible
Leverage of emerging Spark skills and commercial solution ecosystem built on Apache Spark for fast ROI and agility
Integration of analytics heterogeneous data across core systems, social data, website information, etc.
z/OS
Example: Integration of Spark Analytics with Transaction Systems
Initiate Offer
Federated Analytics, Data in Place
Twitter 27
Spark z/OS
REST or Java interface
Data: example IMS
© 2014 IBM Corporation
Spark z/OS Demo: Business Use Case
18
Financial Institution:offers both retail / consumer banking as
well as investment services
Financial Institution:offers both retail / consumer banking as
well as investment services
Business Critical Information owned by the organization:
Credit Card Info
Trade Transaction Info
Customer Info
Stock Price History – public data
Social Media Data: Twitter
GOALGOAL Produce right-time offers for cross sell or upsell, tailored for individual customers based on: customer profile, trade transaction history, credit card purchases and social media sentiment and information
View DemoView Demo https://youtu.be/sDmWcuO5Rk8
© 2014 IBM Corporation
Demo: Configuration
19
DB2 z/OSDB2 z/OS
IMSIMS
VSAMVSAM
Cloudant Linux on zCloudant
Linux on z
Sp
ark Job
S
erverS
park Jo
b
Server JDBCJDBC
JDBCJDBC
JDBC with
Rocket MDSS
JDBC with
Rocket MDSS
IMS Transaction
system
IMS Transaction
system
Credit Card Info
Open Source Visualization Libraries
Customer Info
Trade Transaction Info
Linux on z z/OS
CICSTransaction
system
CICSTransaction
system
REST API
REST API
© 2014 IBM Corporation
Analytics across OLTP & Warehouse information Leverages aggressively cached in-memory distributed computing and JVM threads
Analytics combining business-owned data and external / social data Written in Scala, an object-oriented, functional
Analytics of real-time transactions via streaming, combine with OLTP & social Covers a wide range of workloads Provides SQL, streaming and complex analytics
Custom analytics of SMF leveraging real-time as well as archived data Covers a wide range of workloads Provides SQL, streaming and complex analytics
Use Cases for Apache Spark z/OS, Leveraging data-in-place analytics:
© 2015 IBM Corporation21
Related Links
Here is Ross Mauri‘s July 27 blog announcement: http://ibm.co/1U2y7Cw
Rocket Software’s Announce:
http://blog.rocketsoftware.com/blog/2015/07/26/get-closer-to-your-data-with-spark/#.Venw5kZ6wo1
For More Information please contact…
Len Santalucia, CTO & Business Development Manager Vicom Infinity, Inc. One Penn Plaza – Suite 2010 New York, NY 10119 917-856-4493 mobile [email protected] About Vicom InfinityAccount Presence Since Late 1990’sIBM Premier Business PartnerReseller of IBM Hardware, Software, and MaintenanceVendor Source for the Last 10 Generations of Mainframes/IBM StorageProfessional and IT Architectural ServicesVicom Family of Companies Also Offer Leasing & Financing, Computer Services, and IT Staffing & IT Project Management