martin schneider offering leader, db2 analytics ...€¦ · ease of use (for programmers) §...
TRANSCRIPT
© 2016 IBM Corporation
Concepts in Analytics Martin SchneiderOffering Leader, DB2 Analytics Accelerator
IBM Germany Research and Development GmbH
© 2016 IBM Corporation2
Disclaimer
© Copyright IBM Corporation 2015. All rights reserved.U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP ScheduleContract with IBM Corp.
IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
IBM, the IBM logo, ibm.com, DB2, and DB2 for z/OS are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks of others.
© 2016 IBM Corporation3
Gartner CIO Agenda: Analytics baut seinen Platz 1 der Technologie-Themen weiter aus
3Source: Gartner “Flipping to Digital Leadership – Insights from the 2015 Gartner CIO Agenda Report”, gartner.com/cioagenda
© 2016 IBM Corporation4
Neue Beziehung zum Kunden
4
Accounts
Commerce
Interaction
Branding
Früher: “Ich habe ein Produkt –ich suche dafür einen Kunden”
Heute: “Ich habe einen Kunden –was braucht er am meisten?”
Accounts
Commerce
Interaction
Branding
© 2016 IBM Corporation5
Analytics entscheidet über Profit oder Verluste
5Nutzung von IT als Geschäftsstrategie
80%
jährlicheZunahme des
Customer Lifetime Value für Firmenmit Engagement
Analytics
der Vermarkter senden
dasselbe Material an alle Kunden
der Firmen sind „höchst
zufrieden“ mit der
Bereitstellung von
Informationen für ihre Arbeit
Höhe des typischen
Bußgelds im Fall eines
Regulations-verstoßes einer
Bank
geschätzerVerlust durch
Betrug imGesundheits-
wesen
entgangenes Steuer-
aufkommen aufgrund von
Normverstößen
6% +7,6% 226 Mio $ 16% 100 Mio $
© 2016 IBM Corporation6
Accelerating the Client’s Journey to Cognitive Analytics
Natural, Intuitive or Automated Interaction
Con
text
Spe
cific
Usa
ge
Opportunities to infuse cognition and collaboration in existing solutions and products for differentiation.
Win on Innovation
Compete on time to business value – through context specific data, methods, workflow.
Reasoning
Learning
Natural Language
Optimization
Rules
Predictive Modeling
Forecasting
Statistical Analysis
Alerts
Drilldown Query
Ad-hoc Reports
Standard Reports
Big Data Platforms
ECM
Information Integration
RDBMS
© 2016 IBM Corporation7
§ The need for cognitive analytics is driven by the confluence of SoLoMo (Social, Local, Mobile), Big Data, and Cloud
Veracity Variety
Velocity Volume
Cognitive Systems
Cognitive Analytics in the Context of Big Data – Key Drivers
© 2016 IBM Corporation8
Was istApache Spark?
© 2016 IBM Corporation9
What’s Spark?Origin
Founding Sponsers: Google, Amazon, SAP, IBM
Sponsors: Adobe, Apple, Bosch, Cisco, Cray, EMC, Ericsson, Facebook, Huawei, Informatica, Intel, Microsoft, Netapp, Pivotal, VMWare.
Affiliates: many
2002 – MapReduce @ Google2006 – Hadoop @ Yahoo2010 – Spark paper UC Berkley2011 – Hadoop 1.0 GA2014 – Apache Spark top-level (most active)
Fast§ In-memory distributed computing and JVM threads§ Faster than MapReduce for some workloads
Ease of use (for programmers)§ Written in Scala§ Scala, Python and Java APIs§ Scala and Python interactive shells§ Runs clustered (Mesos,…), standalone or in cloud
General purpose§ Covers a wide range of workloads§ Provides a variety of complex analytics libraries
§ SQL, ML, Streaming, Graph
Logistic regression in Hadoop and Spark
© 2016 IBM Corporation10
Apache Spark
Spark SQLRelationalOperators
Spark MLlibMachineLearning
Spark GraphXGraph
Processing
Spark StreamingReal-TimeStreaming
Spark CoreGeneral Execution Engine
YARN MESOS Standalone
HDFS / Cassandra / HBase / Parquet / ...
Java / Python / Scala / R Languages
Spark Libraries
Spark Core
Cluster Manager
Data Abstraction
© 2016 IBM Corporation11© 2015 IBM Corporation 11
Apache Spark z/OSAvailable since Year End 2015 via Open Source
Securely Integrate OLTP and Business Critical DataIntegrate:• DB2 for z/OS, IMS,
VSAM, PDSE, Syslog, SMF, ...
• Remote (non-z) data on distributed servers, Hadoop, Oracle, ...
© 2016 IBM Corporation12© 2015 IBM Corporation 12
Federated Analytics, Data in PlaceExample: Integration of Spark Analytics with Transaction Systems
Key Values:• Optimized access and z/OS
governed ‘in-memory’ capabilities for core business data, leveraging open source analytic frameworks
• Consistent analytic interfaces for SQL, Graph, Machine Learning across multiple data and system environments
• Leverage of emerging Spark skills and commercial solution ecosystem built on Apache Spark for fast ROI and agility
• Integration of analytics across core systems, social data, website information, etc.
Core data
Core transactions
z/OS Linux on Z
SMS gateway
Sentiment Analysis
Spark node
Spark node
Qualify candidate for promotion
offerSpark node
Spark node
CICS Banking Process:
•Process transaction•Score risk of fraud•Qualify & Initiatepromotion offer
z/OS
Apache Spark SQL
Initiate Offer
© 2016 IBM Corporation13
Spark empowers users to accelerate the insight economyData Scientist
Data EngineerApp Developer
“the convincer”
“the builder”“the thinker”
What they want to do:§ Identify patterns, trends, risks, and opportunities
in data § Discover new actionable insightsHow Spark can help: § Supports the entire data science workflow: from
data access and integration, to analysis, to visualization
§ Provides a growing library of machine learning algorithms
What they want to do:§ Bridge between the Data Scientist and the App Developer § Implement machine learning algorithms at scale§ Put the right data system to work for the job at handHow Spark can help: § Abstract data access complexity (Spark doesn’t care what your data
store is)§ Enables solutions at web-scale
What they want to do:§ Build applications that lever advanced
analytics in partnership with the data scientist and data engineer
§ Follow agile design methodologies§ Optimize performance and meet SLAs
How Spark can help: § Supports the top analytics app languages
such as Python and Scala§ Eliminates programming complexity with
libraries such as MLlib and simplifies DevOps
§ Makes it easy to embed advanced analytics into applications
13
© 2016 IBM Corporation14
Use Case Scenarios
§ Use case and roles based approach to understand entry points and required integration between various components
§ Identified 4 core use case scenarios:
1. SQL on „Open Source“ data stores and data in DB2 for z/OS2. Using „Open Source“ for information integration purposes3. Leveraging „Open Source“ for data exploration, data mining, ML4. Performing SystemT & NLP type analytics on non-structured data
§ Use cases require different integration points and leverage different Spark capabilities, e.g. - Spark SQL- Spark MLlib
© 2016 IBM Corporation15
Was ist einData Lake?
© 2016 IBM Corporation16
© 2016 IBM Corporation17
A group of repositories, managed, governed, protected, connected by metadata and provide
self service access The most important and distinguishing element of a Data Lake is Governance
Data Lake is not an Enterprise Data Warehouse or Ad Hoc Data Mart(s)
It is a methodology to build Analytical repositories over ALL data in a manner which:
• documents their contents (for re-use),• provides data lineage back to source systems, and• allows using the best tool for the job (repository, engine, UI, etc)
IBM’s Data Lake Terminology
© 2016 IBM Corporation18
This...
...caneasilybecomethis
Started as a noble concept .. Data Lake .. resulted in Data Swamps
© 2016 IBM Corporation19
IBM’sDataLake=EfficientManagement,Governance,ProtectionandAccess.
Data Lake
Information Management and Governance Fabric
Data Lake Services
Data Lake Repositories
IBM’s Data Lake
© 2016 IBM Corporation20
Data Lake (System of Insight)
Information Management and Governance Fabric
Catalog
Self-ServiceAccess
EnterpriseIT Data
Exchange
Self-ServiceAccess
Data Lake Repositories
The Data Lake Sub-Systems
© 2016 IBM Corporation21
Data Lake (System of Insight)
Information Management and Governance Fabric
Catalog
Self-ServiceAccess
EnterpriseIT Data
Exchange
Self-ServiceAccess
AnalyticsTeams
Governance, Risk andCompliance Team
InformationCurator
Line of BusinessTeams
Data LakeOperations
Enterprise IT
Other Data Lakes
Systems of Engagement
Data Lake Repositories
Systems of Automation
Systems of Record
New Sources
The Data Lake Users Supported
© 2016 IBM Corporation22
Data Lake
Information Management and Governance Fabric
Data Lake Services
Data Lake RepositoriesEnterprise IT Data Exchange
Enterprise IT
Accelerator(Analytical)
DB2 for z/OS(Operational)
DataOut
Data Lake
Information Management and Governance Fabric
Data Lake Services
Data Lake RepositoriesEnterprise IT Data Exchange
Enterprise IT
Accelerator(Analytical)
DB2 for z/OS(Operational)
InformationServiceCalls
Data Lake
Information Management and Governance Fabric
Data Lake Services
Data Lake Repositories
Accelerator(Analytical)
DB2 for z/OS(Operational)
AnalyticsTeams
Enterprise IT
Data Lake
Information Management and Governance Fabric
Data Lake Services
DataInAccelerator(Analytical)
DB2 for z/OS(Operational)
Enterprise IT Data Exchange
Data Lake Repositories
AnalyticsTeams
Data Lake Deployment Patterns
1. DB2 for z/OS as a source
3. DB2 for z/OS as a consumer of insight
2. DB2 for z/OS as a data platformfor the data Lake
4. DB2 for z/OS as a downstream system
© 2016 IBM Corporation23
Pattern 1: DB2 for z/OS as a Source
Enterprise IT
Data Lake
Information Management and Governance Fabric
Data Lake Services
DataIn
Accelerator(Analytical)
DB2 for z/OS(Operational)
Enterprise IT Data Exchange
Data Lake Repositories
AnalyticsTeams
Deployment:§ DB2forz/OSdatacopiedregularlyintoDataLake§ AnalyticmodelsbuiltinDataLakeOperational
HistoryRepositories
§ Analyticsdiscovery,explorationandmodelingconductedonDataLakedataplatform
§ NewanalyticalmodelsdeployedinAcceleratorforusebyzSystemsapplications
DataOut
© 2016 IBM Corporation24
Pattern 2: DB2 for z/OS as a Data Lake Platform
Data Lake
Information Management and Governance Fabric
Data Lake Services
Data Lake Repositories
Accelerator(Analytical)
DB2 for z/OS(Operational)
AnalyticsTeams
Deployment:§ DB2forz/OSdatadefinedasoneoftheData
LakeDataPlatformsthroughschemasmappedtoDataLake’scatalog
§ MappedschemasinDataLakecatalogselectedfordiscovery,explorationandmodelingofdatainsandboxes
§ DataaccessofzSystemsdataenabledtopulldatadirectlyfromDB2z/OSorAcceleratorintosandboxes
§ DB2z/OSandAcceleratorinthisdeploymentlogicallysitinsidethedatalakeandprovideanalyticstoAnalyticteams,e.g.DataScientists.
© 2016 IBM Corporation25
Pattern 3: DB2 for z/OS as a Consumer
Data Lake
Information Management and Governance Fabric
Data Lake Services
Data Lake RepositoriesEnterprise IT Data Exchange
Enterprise IT
Accelerator(Analytical)
DB2 for z/OS(Operational)
InformationServiceCalls
Deployment:§ SupportedDataLakeAPI’sorstoredprocedurescalledfromzSystemapplicationsto
accessadditionaldataandinsightgeneratedbyanalyticsrunningindatalake§ Requirement– DataLakemustsupportavailabilityrequirementsofzSystems
platform
© 2016 IBM Corporation26
Pattern 4: DB2 for z/OS as a Downstream System
Data Lake
Information Management and Governance Fabric
Data Lake Services
Data Lake RepositoriesEnterprise IT Data Exchange
Enterprise IT
Accelerator(Analytical)
DB2 for z/OS(Operational)
DataOut
Deployment:§ InsightsfromDataLakeandselectedsupportingdatafedtoDB2z/OSorAccelerator§ OperationofzSystemsanddatalakedecoupledinthisdeployment§ zSystemsanalyticalinsightsconductedlocally,howevertheremaybeadelay
betweengeneratedinsightsintheDataLakeandpublishingitintheAccelerator
© 2016 IBM Corporation27
Subject matter experts want access to their organization’s data to explore the content, select, control, annotate and access information using their terminology with an underpinning of protection and governance.
Data Scientists seeking data for new analytics models
Marketeer seeking data for new campaigns
Fraud investigator seeking data to understand the details of suspicious activity
• Day-to-day activity• Requiring ad hoc access to
a wide variety of data sources
• Supporting analysis and decision making
• Using the subject matter experts terminology
Providing the flexibility of spreadsheets that can scale to large volumes, a wide variety of information types whilst protecting sensitive information and optimizing data storage and provisioning.
Business Scenarios we see
© 2016 IBM Corporation28
The hybrid computing platform on z Systems
Supports transaction processing and analytics workloads
concurrently, efficiently and cost-effectively
Delivers industry leading performance for mixed workloads
The unique heterogeneous scale-out platform in the industry
Superior availability, reliability and security
Cloud and Mobile Enabled
TransactionProcessing
AnalyticsWorkload
IBM’s DB2 for z/OS and DB2 Analytics AcceleratorA self-managing, hybrid workload-optimized database management system that runs every query workload in the most efficient way, so that each query is executed in its
optimal environment for greatest performance and cost efficiency
System z Hybrid Transaction/Analytical Processing
© 2016 IBM Corporation29
Enterprise IT
Systems of Record:
ATM, Loan, Deposit, …
Systems of Insight:
Reporting, Analytics …
DB2 Analytics Accelerator
enables System of
Insight Analytics for:
Reporting
Operational Analytics
Quick Model Refresh
z Systems of Record
Real Time Alerts
Reporting
Real-Time Predictive Scoring
optimization
Key Decisions, Constraints, Goals?
Data Data
WhyzSystems?§ Minimizesoreliminatesdatamovementtootherplatformsforreporting.
§ Keepsecureddatawhereitoriginates
§ Significantlyreducesdatalatencytimes.
§ ExploitszSystemsvalueproposition(RAS).
§ Exploitsz13optimization
§ UnlimitedScalability
Supportstransactionalandoperationalanalyticsystemsonsameplatform.
Systems of Engagement:
Collaboration Systems and
Portals,e-Mail,Mobile
z Systems within the Data Lake
© 2016 IBM Corporation30
The Data Lake: Subsystems with Apache Spark
Data Lake (System of Insight)
Information Management and Governance Fabric
Catalog: Management,Governance, Protection
Self-ServiceAccess
EnterpriseIT Data
Exchange
Self-ServiceAccess
Analytics Teams:Analytics, DWH, …
Data Governance, Riskand Compliance Team
InformationCurator
LoB Teams:Risk Modeling, Fraud Mgmt, …
Data LakeOperations
Enterprise IT
Other Data Lakes
Systems of Engagement:CC, e-Mail,
Touchpoints, Notes, …
Systems of Automation
Systems of Record:
ATM, Loan, Deposit, …
New Sources:Social Media,
Twitter, …
Data Usage
Data Lake Repositories
Hadoop(non-structured)
OtherRepositories
(e.g. DB2 LUW)
Teradata, ExadataDB2 for z/OS with
Accelerator(IMS, VSAM, …)
Systems of Insight:
Reporting, Analytics …
© 2016 IBM Corporation31
The Data Lake: Subsystems with Apache Spark
Data Lake (System of Insight)
Information Management and Governance Fabric
Catalog: Management,Governance, Protection
Self-Service
Access –e.g. Spark
EnterpriseIT Data
Exchange
Self-ServiceAccess
Analytics Teams:Analytics, DWH, …
Data Governance, Riskand Compliance Team
InformationCurator
LoB Teams:Risk Modeling, Fraud Mgmt, …
Data LakeOperations
Enterprise IT
Other Data Lakes
Systems of Engagement:CC, e-Mail,
Touchpoints, Notes, …
Systems of Automation
Systems of Record:
ATM, Loan, Deposit, …
New Sources:Social Media,
Twitter, …
Data Usage
Data Lake Repositories
Hadoop(non-structured)
OtherRepositories
(e.g. DB2 LUW)
Teradata, ExadataDB2 for z/OS with
Accelerator(IMS, VSAM, …)
Systems of Insight:
Reporting, Analytics …
© 2016 IBM Corporation32
Imperatives for implementing a successful Data Lake
• Reduce complexity of information supply chain, e.g.
§ Avoid data movement§ Simplify data transformation
§ Use in-DB transformation§ Use temporary tables structures
• Leverage state-of-the-art technology, e.g.
§ HW accelerators § Special-purpose appliances § In-memory processing
• Use federation techniques whenever possible, e.g.
§ Federated SQL queries, leaving data in place
§ Federated analytical processing, leaving data in place
• Adhere to innovative and novel BI/DWH concepts, e.g.
§ Limit number of materialized data marts and data cubes
§ Use aggregation on the fly§ Allow for agile usage patterns
1
3 4
2
These imperatives align well with the strengths of z Systems
© 2016 IBM Corporation33