ibm information server - pr3 systems · 2018-11-15 · 3 ai data science machine learning business...

IBM Analytics

IBM Information ServerWhat is new -- what is next in Data Integration?

November 7th, 2018

Beate Porst – [email protected]

Program Director Offering Management

IBM Unified Governance & Integration

Please note

2

IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion.

Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.

The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

Accelerating the journey to AIto drive innovation and business success

3

AI

Data science

Machine

learning

Business analytics

Trusted analytics foundation

Write once,

access anywhere

with a common access

layer to promote

application independence

Prepare, publish,

integrate and protect

your data to drive

insights while mitigating

compliance risks

Descriptive, predictive,

prescriptive to

understand the current,

predict the future and

change the outcome

Hybrid Data Management

Unified Governance & Integration

Data Science & Visualization

Machine LearningAutomation through

Collect Organize Analyze

IBM Analytics Portfolio

Hybrid Cloud Foundation

5

Our Portfolio: Power behind and across the portfolio

Enable better insight and

compliance across all data through

Unified Governance & Integration

Machine LearningAutomation through

Write once, access anywhere

with a common access

layer to promote

application independence

Hybrid Data

Management

Collect

Prepare, publish, integrate and protect

your data to drive

insights while mitigating

compliance risks

Unified Governance

& I ntegrat ion

Organize

Descriptive, predictive, prescriptive to

understand the current,

predict the future and

change the outcome

Data Science &

Visual izat ion

Analyze

Hybrid Cloud Foundation

IBM InfoSphere Information ServerInformation Empowerment for Your Data Ecosystem

6

Integrating and transforming data and content to deliver

accurate, consistent, timely and complete information through

a unified platform with a common metadata foundation

InfoSphere

Information

Server

Data Quality

Information

Governance

Catalog

Data

Integration

Information Governance Catalog

Understand & Collaborate

− Catalog technical metadata & align w/ business language

− Manage (big) data lineage − BCBS compliance reporting

Data Quality

Cleanse & Monitor

− Analyze, validate, classify− Cleanse & standardize− Define, manage & monitor data rules

+ exceptions

Data Integration

Transform & Deliver

− Massive scalability− Power for any complexity− Deliver in batch and/or real-time

with change capture

Common Connectivity / Shared Metadata / Security / Common Execution Engine With Flexible Deployments (Hadoop, Grid, Cloud)

Build once. Address many needs. Accelerate innovation.

7

ArchivingRecords andretention

Audit readinessSelf-service access to data and analytics

Discovery360-degree information driven insights

Regulations (such as GDPR)

Privacy and protection

EDW optimization

Trusted Analytics Foundation7

8

There is a growing need to provide trusted and business readydata to consumers across the enterprise

Enterprise Need

Self-sufficientBuilder

DeveloperData Scientist

Self-serviceConsumer

BusinessUser

ITBuilder

IT DepartmentCIO

SolutionConsumer

Line of BusinessChief Data Officer CXO

Individual Need

Tech

no

logy

Fo

cus B

usin

ess Focu

s

9 Reasons Why Information Server Data Integration is Best in Class

9

ProductivityRich user interface features simplifies the design process

and metadata management requirements.

TransformationExtensive set of pre-built objects that act on data to satisfy both simple & complex data integration tasks.

ConnectivityNative access to common industry databases and

applications exploiting key features of each.

Built-in GovernanceMaximizes business & IT collaboration providing business

terms, policies, advanced impact analysis, search, comparison & more

PerformanceRuntime engine providing unlimited scalability through all

objects tasks in batch/real-time, ETL/ELT/DV/SOA

OperationSimple management of the operational environment lending analytics for understanding and investigation

AdministrationIntuitive and robust features for install, maintenance,

configuration, security and resiliency

Integrated Data QualitySingle user experience for data integration as well as designing & running data validation, standardization &

matching rules

Deployment & Runtime Flexibility

10

Productivity features for Any Data Integration requirement

– Rich, intuitive UI using the same paradigm and logic constructs independent of its deployment:

• Grid

• Cluster

• Hadoop

– Design a job once and execute ANYWHERE

– Reuse paradigm for any data quality & integration task

– Object Analysis based on rich metadata to graphically illustrate data use

– Team Collaboration and Software Lifecycle Support

– Debugging Features support development

Massive scalability needs an MPP shared nothing Architecture

Dynamic

Instantly get better performance as

hardware resources are added

Extendable

Add new compute nodes to

dynamically scale out t

Partitioned

No contention or upper limitation on

throughput

12

Broader, Faster, SaferPermanently increasing out of the box Connectivity supported on/off Hadoop

Hadoop Cloud Enterprise

• HBASE• HDFS• Kafka • Hive• Impala• BigSQL• Cassandra• MongoDB• Presto

• HortonWorks, Cloudera, MapR, EMR

• AWS S3• AWS EMR/Hive• AWS Redshift• AWS RDS• Azure SQL• Azure Cloud Storage• IBM Db2 on Cloud• IBM Cloudant (via REST

API)• Snowflake• Salesforce• IBM Cloud Object Storage

• Relational: Oracle, Db2, SQL Server, Informix, Sybase, MySQL, Netezza/PDA/IIAS, TD, Greenplum, Postgres, etc.

• IBM: MDM, ILOG, Streams, MQ, TM1• Semi-structured: XML, JSON• File: Flatfile (simple & complex), Excel• Mainframe: Cobol, Db2z, VSAM, etc• APPS: SAP, SFDC, Siebel, Essbase,

Peoplesoft, ORA, SAS• Generic: ODBC, JDBC, REST,...• Programming: C++, JAVA

13

Utilize integrated data quality

Eliminate garbage in, garbage out reporting and analytics by implementing comprehensive and scalable data quality processing.

Cleanse

Business-driven Data

Standardization &

Matching

Assess

M/ L assisted automation

for Data Classification &

Term assignment

Discover

Discovery of Business

entities across

heterogeneous sources

Monitor & Remediate

Enterprise-wide DQ

Exception Monitoring and

collaborative remediation

Validate

Rule-based data

validation to ensure

complete & consistent

data

Life Cycle Governance

Ownership and

management of Policies

& Rules

Built-in Govern and Data Protection

Eliminate an unmanageable data lake by implementing comprehensive data governance, includingend-to-end data lineage, for business users.

Data Lineage

Full end to end data lineage

from EDW to Hadoop

Business Glossary

Establish a business

glossary - Accelerate

using I ndustry Models

Masking of Sensitive Data

Define Rules/ Policies for

handling of sensitive data

Shop for Data

Contextual exploration of

assets and key

relationships

Classify

Automatic Classification

& Term assignment

Discover

Automatic discovery of

Metadata assets

15

Information Server gives you a breath of deployment options:• On-premise or cloud• Ready made , pre-provisioned or fully customizable

Information Server growths with your data demand:• You can start small and slowly scale your

runtime without needing to change your design.

IBM Cloud, AWS & Azure

• Information Server

Hosted offerings for

instant cloud

provisioning and

management

• BYOL: Configure

IBM Information

Server as you like on

any leased Cloud

environments

Docker/ Kubernetes

& I BM Cloud PrivateOn Premise

OnPremise IaaS PaaS


Docker container

deployment on IBM

Cloud for instant

provisioning and

management

• Utilize Information

Server capabilities

through the fully

engineers ICP for

Data solution

• Install / deploy

Information Server

on premise and

connect to cloud and

on premise sources &

applications

Deployment & Runtime FlexibilityBuilt once -- Run Anywhere

IBM Information Server

16

11.5Utilizing the Power

of Hadoop

11.3Reducing the

Platform Footprint

July 2014 September 2015 December 2017

11.7Empowering

the user through tailored design

and automation

Task/feature oriented User oriented

Information Server V11.7.x Release Timeline 2018

December:

• General Availability

IBM Information

Server V11.7.0.0 for:

• Information

Server

Packages and

Information

Server “a-la

carte”

2017

March / April


V11.7.0.0 in-place

upgrade

• Flow Designer

Enhancements

• IS on Hadoop

Enhancements

• Connectivity

enhancements

May/June

• Pack for SAP

Applications v8.1


V11.7 Feature Pack 1

• DataStage V11.7

Docker Container

• BigIntegrate V11.7

Docker Container

September/October

• Information Server /

DataStage Hourly


V11.7 Feature Pack 2


ISEE Docker container

2018

IBM Information Server V11.7 ... moving towards a user centric micro-service based architecture

18

Strengthen the Data Lake

Increasing speed and resilience on Hadoop

Empower the User

New Self-service / User centered experiences for

Integration and Governance

Hybrid (Cloud) Deployment

More and faster deployment options for

Information Server components

Expanding the Reach

More out of the box connectivity for Cloud, Hadoop & Enterprise

Automation & M/L

Increased automation for the Governance, Integration

and Quality process

DataStage Flow Designer– The New Integration Experience

19

Empower the User

Intuitive, browser-based (no-install) experience

– Reducing total cost of ownership

Full backwards compatibility

Accelerated productivity through:

– Automatic schema propagation

– Git Source code control Integration

– Highlighted design errors

– Powerful type-ahead search

– Server-side compilation

New since

V11.7

DataStage Flow Designer Enhancements

20

– Smart Palette: Uses M/L to arrange stages in the palette based on usage.

– Transformer: Ability to map input columns to output columns on links.

– New Stages: Amazon S3, Greenplum, Lookup, Peek and Head, Compress

– Automatic Column Propagation: Changes to column metadata, such as, rename, delete, or, change datatype are automatically propagated down streams

– Load Columns: Ability to load columns from table definitions

– Rename Assets: Support rename for connections, table definitions, jobs, links and stages.

– Parameters: Ability to create, edit and delete Job parameters.

– Preview a data from relational connectors using a live connection.

– Create, edit and delete connections

– Create, delete projects from DFD and assign users/roles

V11.7 FP1

DataStage Flow Designer Git Source Code Control Support

21

– Source code control at the finger tips of the Data Engineer

– Provide synergy between DataStage and Git SCCS

– Maps version of a job in XMETA to a version in Git

– Supports Continues Engineering for DS artifacts

– Easily rollback to a previous version

– Supports working on multiple versions of a job

V11.7 FP1

M/L infused DataStage Flow DesignerThe more you Design, the Smarter it gets

Smart stage suggestionSmart clusteringSelecting “smartness”

V11.7 FP2

DataStage Flow Designer Sequence Job Support

23

– Design Sequence Jobs in DataStage Flow Designer

V11.7 FP2

Strengthen the Data Lake

24

Faster Deployment on Hadoop

Improved Preemption Handling

Reducing the Resource

Footprint on Hadoop

Hybrid on/off Hadoop

Runtime

− Deep integration for Ambari & Cloudera Manager

− 10x accelerated deployment time

− Automatically captures all parameters for node deployment

− Instant access to IS activities on Hadoop such as CPU, Mem, Queue status

− Accurate job/error handling during container preemption

− Remembering preemption notification during container allocation

− Sending notification with diagnostic to conductor in preemption case

− Checkpointing for automatic restartability

− Utilizing Hadoop Shuffle space as Information Server Scratch space

− Simple user choice through APT Configuration file setting

− 28% reduction of the actual library size installed on the data nodes (1.8GB to 1.3GB)

− Use a single Instance of Information Server to run Hadoop and non Hadoop workload

− Optimized resource utilization for dedicated workloads against non Hadoop sources/targets

− Simple APT Configuration option

V11.7, FP1,

FP2

BigIntegrate / BigQualityCheck Pointing & automatic Job Restart Phase 1

• Support automatic restart of jobs from scratch

• Inserts blocking checkpoint before targets

• Checkpoint buffers data in memory and disk until End of Data (EOD)

• If job fails before EOD it will automatically restarted

• Enabled through new variables:• APT_CHECKPOINT_RESTART – defines number of restart attempts• APT_CHECKPOINT_ENABLED – “targetonly” is the only supported value for Phase 1

Checkpoint

V11.7 FP1 &

FP2

Deepening the Ambari Integration

26

– Added new Ambari Widgets specific to BigIntegrate/BigQuality queue activities:

• Volume of queue including historical data

• Mem utilization

• CPU utilization

– Now includes quick links from Ambari to IS OpsConsole, DataStage Flow Designer and IS Launchpad

• Added a parameter for IS Service Tier URL in the Ambari Config which is use to resolve the Quick Links

– Simplified Kerberos deployment through the AmbariConsole

V11.7 FP1

Now

supported

on Cloudera

Manager

Hadoop

− New HBase connector

− Hadoop File Connector performance & security enhancements

− Kafka Connector improvements

− Hive Connector enhancements

− MongoDB support

− New Cassandra connector

− BDFS Kerberos improvements for

non Hadoop environments

− Cassandra Phase 2 in FP2 -- Data

Lineage and metadata import

− Apache Sequential File support for

File Connector

− Kafka 0.11 and 1.0 certification

− Cloudera 5.15

− HDP 3.0

Broader, Faster, Saver—Increasing Out of the Box Connectivity

27

Expanding the Reach

Enterprise

− Oracle PDB and CDB support

− Siebel 8.2.2.4 certification

− Sybase datatype enhancement & IQ 16.1 support

− Security enhancement for metadata import

− New SAP BW & ERP Ppacks

− Data Masking ODPP v11.3 support and expanded Data masking policy support

− DTS Connector: MQ Client mode

− MQ Connector version update

− ILOG Connector Decision Engine

− Db2 v12 z/OS certification

− Greenplum v5.4 certification

− TD Big Buffer Support and TD 16.2 certification

− New SAP ERP Pack V8.1

− Sybase IQ 16.1

− Db2 connector support for External Tables

− RJUT usability improvements

Cloud

− Amazon EMR/Hive

− Amazon Redshift

− New Snowflake connector

− New Azure Cloud Storage connector

− SFDC API 42 support

− Amazon S3 KMS Support

− Amazon S3 Parquet and ORC support

− New IBM Cloud Object Storage connector

V11.7, FP1

& FP2

Some views on the new IBM Cloud Object Store Connectivity

28

– Supported as source and target

– Supports CSV, Parquet, Avro, JSON and Excel files

– Can write &delete files

– Can read single/multiple files using wild cards or regex expressions

V11.7 FP2

Information Server Docker Container-based Deployment

• Docker / Kubernetes deployment available for:• DataStage

• BigIntegrate• Information Server Enterprise Edition

• Be ready to use an instance in as little as 20 minutes

• Providing you with horizontal scaling and self-healing• Instantly restart a container if it fails• Expand / shrink your engine tier nodes if you need

more/less capacity

V11.7 FP1 &

FP2

DataStage BigIntegrate

Enterprise Edition

IBM INFORMATION INTEGRATION

Vision & Strategy Update

Compose

Enable the platform as loosely coupled service for fast & easy deployment

Automate

Infuse data science and machine learning into everything we do

Multi Cloud

Flexible cloud deployment and optimized workload

Simplify

Make products accessible and easily consumable

31

Development driven by Key Priorities

32

Unified Governance and Integration PlatformA service-based architecture underpinned by common Metadata & Governance foundation

Core Unified Governance & Data Integration Services

API

StoredIQ Apache Atlas IBM Cloud Private for Data Watson Studio

Ref. Data Mgt

Consent Management

Auto Data Classification

Auto Data Quality

Self-Service Data Preparation

Data Integration Collaboration

Auto Term Assign

Policy Management

Policy Enforcement

Open Workflow

Auto Data Discovery

Shop4info

Entity Resolution OMRS

Information Server

Powered By

Open Metadata Services

Smart Metadata

Knowledge

Graph

Contextual

Business Metadata

Operational

Technical Metadata

Social

Lineage

IGC

ML & Industry Context Infused

API’s API’s

API’s

Data Sources

Systems of Insights

Cloud

Hadoop

Social Media

News

MDM

Documents

Other External

Systems of Engagement

Systems of Record

Data Engineer

Business Users

Data Scientist

Data Quality Analyst

Governance Officers

Data Steward

CDO

Personalized User Experience

33

PX Spark

Batch

Real-time

Event-driven

I nteractive Personalized Experience

Shape &

Curate

Pattern & ML driven

flow builderComprehensive

Flow Design

Open APIProjects

Services

Operat ions &

Administrat ion

Built-in Governance & M/ L

Mic

ro-s

erv

ices

User experience adapting to users needs across the enterprise--> NOT the user adapting to the experience

Any user leverages the same enterprise-ready foundation

Adaptable Integration Experiences

34

Accelerating Data Integration through M/L based automationMachine Learning

assisted Data Integration • Shop for Data Integration

• M/L supported auto generation of Integration Flows

• Data Movement Strategy

• Heuristic Job Optimization across multi cloud / multi engine

• M/L Job Design Optimization

• Smart User Interface

• M/L based stage suggestions

• M/L based smart palette

• M/L based asset organizer

Multi Cloud OptimizationCustomers are operating across environments in multiple clouds,

• Anywhere ad-hoc service provisioning

anywhere

• Runtime/Deployment elasticity

• Dynamically expand/shrink

capacity based on workload

requirements and data location

• Seamless interoperability between IBMs

private & public cloud integration

services

• Flexible licensing (metered or fixed)

Information Server -- Data IntegrationRelease Plan

Q4 / 2018

Delivery method: PatchesPlatformInformation Server (Docker) on AWS and Azure

Connectivity• Presto certification (Simba driver) -- Done!• Azure ADLS (stretch goal)• DynamoDB certification (Simba driver)• Google FS• Google BigQuery• Hive Update/Delete• AWS Aurora (Postgres)• Zookeeper for Hive support (new DD Driver certification)• SFDC Bulk load with PK chunking

DFD• Support for Table Definitions Creation• Information Server on Cloud Managed• Support for BitBucket in addition to GitHub• Parameter Set support• Expose more Job properties• Show execution metrics• More stages -- e.g. Java, Hierarchical• Support for Reject links for Lookup & ConnectorsEngine / Hadoop• IS on Hadoop HDFS Scratch Disk performance improvements• View of Hadoop log files from a different user id• BigIntegrate -- resource optimization for lookup and transformer• DataStage on Spark Technology Preview

Information Server Release Plan 2019

1H:

• Planned new Release for

Information Server with focus on

M/L and Automation

• MVP Business User driven data

preparation & curation

• DataStage runtime on Spark

• Agent Stella

• DataStage / ISEE managed offerings

on AWS & Azure

2H:

• Multi-cloud runtime optimization and

automatic runtime selection -- Data

Movement strategy

• Machine-learning based automatic flow

generation (FT 2.0)

2019