ibm information server - pr3 systems · 2018-11-15 · 3 ai data science machine learning business...
Post on 05-Apr-2020
2 Views
Preview:
TRANSCRIPT
IBM Analytics
IBM Information ServerWhat is new -- what is next in Data Integration?
November 7th, 2018
Beate Porst – porst@us.ibm.com
Program Director Offering Management
IBM Unified Governance & Integration
Please note
2
IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice and at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract.
The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
Accelerating the journey to AIto drive innovation and business success
3
AI
Data science
Machine
learning
Business analytics
Trusted analytics foundation
Write once,
access anywhere
with a common access
layer to promote
application independence
Prepare, publish,
integrate and protect
your data to drive
insights while mitigating
compliance risks
Descriptive, predictive,
prescriptive to
understand the current,
predict the future and
change the outcome
Hybrid Data Management
Unified Governance & Integration
Data Science & Visualization
Machine LearningAutomation through
Collect Organize Analyze
IBM Analytics Portfolio
Hybrid Cloud Foundation
5
Our Portfolio: Power behind and across the portfolio
Enable better insight and
compliance across all data through
Unified Governance & Integration
Machine LearningAutomation through
Write once, access anywhere
with a common access
layer to promote
application independence
Hybrid Data
Management
Collect
Prepare, publish, integrate and protect
your data to drive
insights while mitigating
compliance risks
Unified Governance
& I ntegrat ion
Organize
Descriptive, predictive, prescriptive to
understand the current,
predict the future and
change the outcome
Data Science &
Visual izat ion
Analyze
Hybrid Cloud Foundation
IBM InfoSphere Information ServerInformation Empowerment for Your Data Ecosystem
6
Integrating and transforming data and content to deliver
accurate, consistent, timely and complete information through
a unified platform with a common metadata foundation
InfoSphere
Information
Server
Data Quality
Information
Governance
Catalog
Data
Integration
Information Governance Catalog
Understand & Collaborate
− Catalog technical metadata & align w/ business language
− Manage (big) data lineage − BCBS compliance reporting
Data Quality
Cleanse & Monitor
− Analyze, validate, classify− Cleanse & standardize− Define, manage & monitor data rules
+ exceptions
Data Integration
Transform & Deliver
− Massive scalability− Power for any complexity− Deliver in batch and/or real-time
with change capture
Common Connectivity / Shared Metadata / Security / Common Execution Engine With Flexible Deployments (Hadoop, Grid, Cloud)
Build once. Address many needs. Accelerate innovation.
7
ArchivingRecords andretention
Audit readinessSelf-service access to data and analytics
Discovery360-degree information driven insights
Regulations (such as GDPR)
Privacy and protection
EDW optimization
Trusted Analytics Foundation7
8
There is a growing need to provide trusted and business readydata to consumers across the enterprise
Enterprise Need
Self-sufficientBuilder
DeveloperData Scientist
Self-serviceConsumer
BusinessUser
ITBuilder
IT DepartmentCIO
SolutionConsumer
Line of BusinessChief Data Officer CXO
Individual Need
Tech
no
logy
Fo
cus B
usin
ess Focu
s
9 Reasons Why Information Server Data Integration is Best in Class
9
ProductivityRich user interface features simplifies the design process
and metadata management requirements.
TransformationExtensive set of pre-built objects that act on data to satisfy both simple & complex data integration tasks.
ConnectivityNative access to common industry databases and
applications exploiting key features of each.
Built-in GovernanceMaximizes business & IT collaboration providing business
terms, policies, advanced impact analysis, search, comparison & more
PerformanceRuntime engine providing unlimited scalability through all
objects tasks in batch/real-time, ETL/ELT/DV/SOA
OperationSimple management of the operational environment lending analytics for understanding and investigation
AdministrationIntuitive and robust features for install, maintenance,
configuration, security and resiliency
Integrated Data QualitySingle user experience for data integration as well as designing & running data validation, standardization &
matching rules
Deployment & Runtime Flexibility
10
Productivity features for Any Data Integration requirement
– Rich, intuitive UI using the same paradigm and logic constructs independent of its deployment:
• Grid
• Cluster
• Hadoop
– Design a job once and execute ANYWHERE
– Reuse paradigm for any data quality & integration task
– Object Analysis based on rich metadata to graphically illustrate data use
– Team Collaboration and Software Lifecycle Support
– Debugging Features support development
Massive scalability needs an MPP shared nothing Architecture
Dynamic
Instantly get better performance as
hardware resources are added
Extendable
Add new compute nodes to
dynamically scale out t
Partitioned
No contention or upper limitation on
throughput
12
Broader, Faster, SaferPermanently increasing out of the box Connectivity supported on/off Hadoop
Hadoop Cloud Enterprise
• HBASE• HDFS• Kafka • Hive• Impala• BigSQL• Cassandra• MongoDB• Presto
• HortonWorks, Cloudera, MapR, EMR
• AWS S3• AWS EMR/Hive• AWS Redshift• AWS RDS• Azure SQL• Azure Cloud Storage• IBM Db2 on Cloud• IBM Cloudant (via REST
API)• Snowflake• Salesforce• IBM Cloud Object Storage
• Relational: Oracle, Db2, SQL Server, Informix, Sybase, MySQL, Netezza/PDA/IIAS, TD, Greenplum, Postgres, etc.
• IBM: MDM, ILOG, Streams, MQ, TM1• Semi-structured: XML, JSON• File: Flatfile (simple & complex), Excel• Mainframe: Cobol, Db2z, VSAM, etc• APPS: SAP, SFDC, Siebel, Essbase,
Peoplesoft, ORA, SAS• Generic: ODBC, JDBC, REST,...• Programming: C++, JAVA
13
Utilize integrated data quality
Eliminate garbage in, garbage out reporting and analytics by implementing comprehensive and scalable data quality processing.
Cleanse
Business-driven Data
Standardization &
Matching
Assess
M/ L assisted automation
for Data Classification &
Term assignment
Discover
Discovery of Business
entities across
heterogeneous sources
Monitor & Remediate
Enterprise-wide DQ
Exception Monitoring and
collaborative remediation
Validate
Rule-based data
validation to ensure
complete & consistent
data
Life Cycle Governance
Ownership and
management of Policies
& Rules
Built-in Govern and Data Protection
Eliminate an unmanageable data lake by implementing comprehensive data governance, includingend-to-end data lineage, for business users.
Data Lineage
Full end to end data lineage
from EDW to Hadoop
Business Glossary
Establish a business
glossary - Accelerate
using I ndustry Models
Masking of Sensitive Data
Define Rules/ Policies for
handling of sensitive data
Shop for Data
Contextual exploration of
assets and key
relationships
Classify
Automatic Classification
& Term assignment
Discover
Automatic discovery of
Metadata assets
15
Information Server gives you a breath of deployment options:• On-premise or cloud• Ready made , pre-provisioned or fully customizable
Information Server growths with your data demand:• You can start small and slowly scale your
runtime without needing to change your design.
IBM Cloud, AWS & Azure
• Information Server
Hosted offerings for
instant cloud
provisioning and
management
• BYOL: Configure
IBM Information
Server as you like on
any leased Cloud
environments
Docker/ Kubernetes
& I BM Cloud PrivateOn Premise
OnPremise IaaS PaaS
• Information Server
Docker container
deployment on IBM
Cloud for instant
provisioning and
management
• Utilize Information
Server capabilities
through the fully
engineers ICP for
Data solution
• Install / deploy
Information Server
on premise and
connect to cloud and
on premise sources &
applications
Deployment & Runtime FlexibilityBuilt once -- Run Anywhere
IBM Information Server
16
11.5Utilizing the Power
of Hadoop
11.3Reducing the
Platform Footprint
July 2014 September 2015 December 2017
11.7Empowering
the user through tailored design
and automation
Task/feature oriented User oriented
Information Server V11.7.x Release Timeline 2018
December:
• General Availability
IBM Information
Server V11.7.0.0 for:
• Information
Server
Packages and
Information
Server “a-la
carte”
2017
March / April
• Information Server
V11.7.0.0 in-place
upgrade
• Flow Designer
Enhancements
• IS on Hadoop
Enhancements
• Connectivity
enhancements
May/June
• Pack for SAP
Applications v8.1
• Information Server
V11.7 Feature Pack 1
• DataStage V11.7
Docker Container
• BigIntegrate V11.7
Docker Container
September/October
• Information Server /
DataStage Hourly
• Information Server
V11.7 Feature Pack 2
• Information Server
ISEE Docker container
2018
IBM Information Server V11.7 ... moving towards a user centric micro-service based architecture
18
Strengthen the Data Lake
Increasing speed and resilience on Hadoop
Empower the User
New Self-service / User centered experiences for
Integration and Governance
Hybrid (Cloud) Deployment
More and faster deployment options for
Information Server components
Expanding the Reach
More out of the box connectivity for Cloud, Hadoop & Enterprise
Automation & M/L
Increased automation for the Governance, Integration
and Quality process
DataStage Flow Designer– The New Integration Experience
19
Empower the User
Intuitive, browser-based (no-install) experience
– Reducing total cost of ownership
Full backwards compatibility
Accelerated productivity through:
– Automatic schema propagation
– Git Source code control Integration
– Highlighted design errors
– Powerful type-ahead search
– Server-side compilation
New since
V11.7
DataStage Flow Designer Enhancements
20
– Smart Palette: Uses M/L to arrange stages in the palette based on usage.
– Transformer: Ability to map input columns to output columns on links.
– New Stages: Amazon S3, Greenplum, Lookup, Peek and Head, Compress
– Automatic Column Propagation: Changes to column metadata, such as, rename, delete, or, change datatype are automatically propagated down streams
– Load Columns: Ability to load columns from table definitions
– Rename Assets: Support rename for connections, table definitions, jobs, links and stages.
– Parameters: Ability to create, edit and delete Job parameters.
– Preview a data from relational connectors using a live connection.
– Create, edit and delete connections
– Create, delete projects from DFD and assign users/roles
V11.7 FP1
DataStage Flow Designer Git Source Code Control Support
21
– Source code control at the finger tips of the Data Engineer
– Provide synergy between DataStage and Git SCCS
– Maps version of a job in XMETA to a version in Git
– Supports Continues Engineering for DS artifacts
– Easily rollback to a previous version
– Supports working on multiple versions of a job
V11.7 FP1
M/L infused DataStage Flow DesignerThe more you Design, the Smarter it gets
Smart stage suggestionSmart clusteringSelecting “smartness”
V11.7 FP2
DataStage Flow Designer Sequence Job Support
23
– Design Sequence Jobs in DataStage Flow Designer
V11.7 FP2
Strengthen the Data Lake
24
Faster Deployment on Hadoop
Improved Preemption Handling
Reducing the Resource
Footprint on Hadoop
Hybrid on/off Hadoop
Runtime
− Deep integration for Ambari & Cloudera Manager
− 10x accelerated deployment time
− Automatically captures all parameters for node deployment
− Instant access to IS activities on Hadoop such as CPU, Mem, Queue status
− Accurate job/error handling during container preemption
− Remembering preemption notification during container allocation
− Sending notification with diagnostic to conductor in preemption case
− Checkpointing for automatic restartability
− Utilizing Hadoop Shuffle space as Information Server Scratch space
− Simple user choice through APT Configuration file setting
− 28% reduction of the actual library size installed on the data nodes (1.8GB to 1.3GB)
− Use a single Instance of Information Server to run Hadoop and non Hadoop workload
− Optimized resource utilization for dedicated workloads against non Hadoop sources/targets
− Simple APT Configuration option
V11.7, FP1,
FP2
BigIntegrate / BigQualityCheck Pointing & automatic Job Restart Phase 1
• Support automatic restart of jobs from scratch
• Inserts blocking checkpoint before targets
• Checkpoint buffers data in memory and disk until End of Data (EOD)
• If job fails before EOD it will automatically restarted
• Enabled through new variables:• APT_CHECKPOINT_RESTART – defines number of restart attempts• APT_CHECKPOINT_ENABLED – “targetonly” is the only supported value for Phase 1
Checkpoint
V11.7 FP1 &
FP2
Deepening the Ambari Integration
26
– Added new Ambari Widgets specific to BigIntegrate/BigQuality queue activities:
• Volume of queue including historical data
• Mem utilization
• CPU utilization
– Now includes quick links from Ambari to IS OpsConsole, DataStage Flow Designer and IS Launchpad
• Added a parameter for IS Service Tier URL in the Ambari Config which is use to resolve the Quick Links
– Simplified Kerberos deployment through the AmbariConsole
V11.7 FP1
Now
supported
on Cloudera
Manager
Hadoop
− New HBase connector
− Hadoop File Connector performance & security enhancements
− Kafka Connector improvements
− Hive Connector enhancements
− MongoDB support
− New Cassandra connector
− BDFS Kerberos improvements for
non Hadoop environments
− Cassandra Phase 2 in FP2 -- Data
Lineage and metadata import
− Apache Sequential File support for
File Connector
− Kafka 0.11 and 1.0 certification
− Cloudera 5.15
− HDP 3.0
Broader, Faster, Saver—Increasing Out of the Box Connectivity
27
Expanding the Reach
Enterprise
− Oracle PDB and CDB support
− Siebel 8.2.2.4 certification
− Sybase datatype enhancement & IQ 16.1 support
− Security enhancement for metadata import
− New SAP BW & ERP Ppacks
− Data Masking ODPP v11.3 support and expanded Data masking policy support
− DTS Connector: MQ Client mode
− MQ Connector version update
− ILOG Connector Decision Engine
− Db2 v12 z/OS certification
− Greenplum v5.4 certification
− TD Big Buffer Support and TD 16.2 certification
− New SAP ERP Pack V8.1
− Sybase IQ 16.1
− Db2 connector support for External Tables
− RJUT usability improvements
Cloud
− Amazon EMR/Hive
− Amazon Redshift
− New Snowflake connector
− New Azure Cloud Storage connector
− SFDC API 42 support
− Amazon S3 KMS Support
− Amazon S3 Parquet and ORC support
− New IBM Cloud Object Storage connector
V11.7, FP1
& FP2
Some views on the new IBM Cloud Object Store Connectivity
28
– Supported as source and target
– Supports CSV, Parquet, Avro, JSON and Excel files
– Can write &delete files
– Can read single/multiple files using wild cards or regex expressions
V11.7 FP2
Information Server Docker Container-based Deployment
• Docker / Kubernetes deployment available for:• DataStage
• BigIntegrate• Information Server Enterprise Edition
• Be ready to use an instance in as little as 20 minutes
• Providing you with horizontal scaling and self-healing• Instantly restart a container if it fails• Expand / shrink your engine tier nodes if you need
more/less capacity
V11.7 FP1 &
FP2
DataStage BigIntegrate
Enterprise Edition
IBM INFORMATION INTEGRATION
Vision & Strategy Update
Compose
Enable the platform as loosely coupled service for fast & easy deployment
Automate
Infuse data science and machine learning into everything we do
Multi Cloud
Flexible cloud deployment and optimized workload
Simplify
Make products accessible and easily consumable
31
Development driven by Key Priorities
32
Unified Governance and Integration PlatformA service-based architecture underpinned by common Metadata & Governance foundation
Core Unified Governance & Data Integration Services
API
StoredIQ Apache Atlas IBM Cloud Private for Data Watson Studio
Ref. Data Mgt
Consent Management
Auto Data Classification
Auto Data Quality
Self-Service Data Preparation
Data Integration Collaboration
Auto Term Assign
Policy Management
Policy Enforcement
Open Workflow
Auto Data Discovery
Shop4info
Entity Resolution OMRS
Information Server
Powered By
Open Metadata Services
Smart Metadata
Knowledge
Graph
Contextual
Business Metadata
Operational
Technical Metadata
Social
Lineage
IGC
ML & Industry Context Infused
API’s API’s
API’s
Data Sources
Systems of Insights
Cloud
Hadoop
Social Media
News
MDM
Documents
Other External
Systems of Engagement
Systems of Record
Data Engineer
Business Users
Data Scientist
Data Quality Analyst
Governance Officers
Data Steward
CDO
Personalized User Experience
33
PX Spark
Batch
Real-time
Event-driven
I nteractive Personalized Experience
Shape &
Curate
Pattern & ML driven
flow builderComprehensive
Flow Design
Open APIProjects
Services
Operat ions &
Administrat ion
Built-in Governance & M/ L
Mic
ro-s
erv
ices
User experience adapting to users needs across the enterprise--> NOT the user adapting to the experience
Any user leverages the same enterprise-ready foundation
Adaptable Integration Experiences
34
Accelerating Data Integration through M/L based automationMachine Learning
assisted Data Integration • Shop for Data Integration
• M/L supported auto generation of Integration Flows
• Data Movement Strategy
• Heuristic Job Optimization across multi cloud / multi engine
• M/L Job Design Optimization
• Smart User Interface
• M/L based stage suggestions
• M/L based smart palette
• M/L based asset organizer
Multi Cloud OptimizationCustomers are operating across environments in multiple clouds,
• Anywhere ad-hoc service provisioning
anywhere
• Runtime/Deployment elasticity
• Dynamically expand/shrink
capacity based on workload
requirements and data location
• Seamless interoperability between IBMs
private & public cloud integration
services
• Flexible licensing (metered or fixed)
Information Server -- Data IntegrationRelease Plan
Q4 / 2018
Delivery method: PatchesPlatformInformation Server (Docker) on AWS and Azure
Connectivity• Presto certification (Simba driver) -- Done!• Azure ADLS (stretch goal)• DynamoDB certification (Simba driver)• Google FS• Google BigQuery• Hive Update/Delete• AWS Aurora (Postgres)• Zookeeper for Hive support (new DD Driver certification)• SFDC Bulk load with PK chunking
DFD• Support for Table Definitions Creation• Information Server on Cloud Managed• Support for BitBucket in addition to GitHub• Parameter Set support• Expose more Job properties• Show execution metrics• More stages -- e.g. Java, Hierarchical• Support for Reject links for Lookup & ConnectorsEngine / Hadoop• IS on Hadoop HDFS Scratch Disk performance improvements• View of Hadoop log files from a different user id• BigIntegrate -- resource optimization for lookup and transformer• DataStage on Spark Technology Preview
Information Server Release Plan 2019
1H:
• Planned new Release for
Information Server with focus on
M/L and Automation
• MVP Business User driven data
preparation & curation
• DataStage runtime on Spark
• Agent Stella
• DataStage / ISEE managed offerings
on AWS & Azure
2H:
• Multi-cloud runtime optimization and
automatic runtime selection -- Data
Movement strategy
• Machine-learning based automatic flow
generation (FT 2.0)
2019
38
top related