1
BUS05 The Evolution of Data Integration
John Motler
Principal Sales Consultant
Informatica
2
Application Database Partner Data
SWIFT NACHA HIPAA …
Cloud Computing Unstructured
Data Warehouse
Data Migration
Test Data Management & Archiving
Master Data Management
Data Synchronization
B2B Data
Exchange
Data
Consolidation
What is Data Integration
Integration Platform
3
Breadth of Data Ever Increasing
WebSphere MQ JMS MSMQ SAP NetWeaver XI
JD Edwards Lotus Notes Oracle E-Business PeopleSoft
Oracle DB2 UDB DB2/400 SQL Server Sybase
ADABAS Datacom DB2 IDMS IMS
Word, Excel PDF StarOffice WordPerfect Email (POP, IMPA) HTTP
Informix Teradata Netezza ODBC JDBC
VSAM C-ISAM Binary Flat Files Tape Formats…
Web Services TIBCO webMethods
SAP NetWeaver SAP NetWeaver BI SAS Siebel
Messaging, and Web Services
Packaged Applications
Relational and Flat Files
Mainframe and Midrange
Unstructured Data and Files Flat files
ASCII reports HTML RPG ANSI LDAP
Industry Standards EDI–X12
EDI-Fact
RosettaNet
HL7
HIPAA
ebXML
HL7 v3.0
ACORD (AL3, XML)
XML Standards XML
LegalXML
IFX
cXML
AST
FIX
Cargo IMP
MVR
Salesforce CRM
Force.com
RightNow
NetSuite
SaaS/BPO ADP Hewitt SAP By Design Oracle OnDemand
4
Evolution of Data Integration
5
1960s and 1970s
Databases and Applications
6
The Database
• 1960s Network / Hierarchical Databases
• 1970s Relational Databases
• IBM – E.F. Codd - System R - SQL – 1974
• DB2 – 1983 mainframe
• Ingres – 1973 (first product 1979)
• Oracle – 1978 first release
• Sybase – 1980s
• MS SQL Server – 1990s
• Others
• Integrated hardware and software – Teradata / System 38
• Object Oriented Databases
• NoSQL databases
7
1980s
Data Integration, Data Warehouses, ETL
8
Early Application/Data Integration
Sales
Payroll
Cust.
Service
Shipping
Extract
& Split
Join
& Load
• Integration needs increased with
increase in repositories
• Tools emerged to generate code
that pulled and pushed data
• Mainframe data used COBOL
scripts, Open Systems C
• Scripts to transfer data proliferated
• Approach known as ETL, but
growth of tools was driven by
emergence of ?????
9
Data Warehouse
• 1970s: Bill Innon defines term Data Warehouse
• 1983: Teradata releases decision support DBMS
• 1990: Red Brick (Ralph Kimball) released
• 1990s: Informatica releases PowerMart a GUI
based ETL tool
10
ETL Capabilities
• Adapters
• Graphical development
environment
• Transformation library:
Joining tables & files
Pivoting for normalization
Aggregating
Slowly Changing Dims.
Lookups
Parsing
Expressions
• Metadata architecture
• Object principles
• Performance tuning
• High availability
11
1990s
Data Quality and EAI
12
Data Quality
• Emerging Need for Data Quality • Many Data Warehouse and Data Migration projects failed
because of quality issues
• Data Warehouse being used to make business decisions with incorrect data, not cause but became the focus
• Government led • (NCOA) providing postal data to avoid huge cost of postage etc.
• Initially service based but expanded to server offerings
• Name and Address initial focus but expanded to all data type and domains
• Data Quality tools added to ETL Capabilities
13
Data Quality Capabilities
• Data Profiling
• Initially assessing the data to understand its quality challenges
• Data Standardization
• A rules engine that ensures that data conforms to quality rules
• Address Validation and Geocoding
• For name and address data. Corrects data to US and Worldwide postal standards
• Matching or Linking
• A way to compare data so that similar, but slightly different records can be aligned. Matching may use "fuzzy logic" to find duplicates in the data.
• Monitoring
• Keeping track of data quality over time and reporting variations in the quality of data.
• Batch and Real-Time • Once the data is initially cleansed (batch), companies often want to build the
processes into enterprise applications to keep it clean.
14
Enterprise Application Integration
Integration framework composed of a collection of
technologies and services which form a middleware to
enable integration of systems and applications across the
enterprise
• Extends from just data to business data and process
• Topologies • Hub-and-Spoke - central message broker
• Bus – central messaging backbone
• Emergence of EAI offerings • IBM WebSphere MQ, TIBCO, Vitria, SeeBeyond, WebMethods
• Early promise but many failures
• Constant change, competing standards, conflicting requirements, scale of
initiatives
15
2000s
XML & Web Services, Data Exchanges, ESB, Data Federation,
Real-time and MDM
16
XML
• eXtensible Markup Language
• a set of rules for encoding documents in a format that is both human-
readable and machine-readable
• 1.0 initial 1998, fifth edition 2008; 1.1 initial 2004 (limited use)
• Importance • Drives a lot of B2B communications
• Many standard XML formats
• US based Standard for EDI – ASC X12 (Finance, Healthcare, Insurance
Exchanges)
• In conjunction with SOAP and HTTP forms the backbone for the next key
integration technology (?)
• Criticism
• Verbose and complex hence emergence of competing data interchange
standards (JSON)
17
Web Services and SOA
• Key Concepts • Less data integration but more business service or process
integration
• Loosely coupled service that provides a single action or business function
• Multiple Web Services are orchestrated to provide application functionality
• Services should be discoverable and with a simple interface
• Key Principles • Reuse, granularity, modularity, componentization and
interoperability.
• Standards-compliance (both common and industry-specific).
• Services identification and categorization, provisioning and delivery, and monitoring and tracking
• Web services alone as SOA can not handle the complex, secure and SLA based applications of an enterprise.
18
B2B Data Exchanges and EDI
External Partners Internal Systems
B2B Data Exchange
Data Integration
Monitoring
Partner Management
Data Transformation
Managed File Transfer
“Computer-to-computer interchange of strictly formatted messages. EDI
implies a sequence of messages between two parties, either of whom may
serve as originator or recipient. The formatted data representing the
documents may be transmitted from originator to recipient via
telecommunications or physically transported on electronic storage media."
19
Data Exchange Functional Architecture
Monitoring
Data Exchange
Events
Reprocessing
Searching
Archiving
Data Transformation
Industry
Standards
Excel Mapping
Specification
Universal Data
Transformation
Visual Mapping
Design
Data Flow Graphical Flow
Design
Logging
Transformation
Reconciliation
Routing
Managed File Transfer Multi-Protocol
support
Encryption
Pre Configured
Connections
Certificate
Management
Reporting
Security
Trading Partners
Mgmt. Onboarding
Profiles
Store & Forward
Endpoints
management
Scheduling
Dashboards
Validations
20
Enterprise Service Buses
• Key Concepts
• Backbone to support SOA architectures
• Supports move to service integration
• Provides connectivity, application adapters, rule based routing of messages, and limited data transformation
• Newer wave of message based EAI
21
Data Federation or Virtualization
• Key Concepts
• Query data but don’t physically move it
• Supported early need for “web presence”
• Changing latency requirements for data and reports
22
Real-time Integration
• Requirements
• Support need for decreasing latency for reporting and business intelligence
• Application synchronization
• Approaches
• Data Federation one approach
• Change Data Capture
• identification, capture and delivery of the changes made to enterprise data sources.
• Allows transformation of data
• Data Replication technology
• Copy of production data or subsets
• Used mostly for reliability or fault tolerance
23
Master Data Management
Application Legacy Cloud Computing Unstructured
Eligibility Provider
Eligibility Client
Third Party Data
Benefits Client
Benefits Provider
Eligibility Provider
Eligibility Client
Benefits Benefits
Client Provider
MMIS Other:
CHIP, TANF, etc.
Data
Governance
?
No Single
View of
Master Data
Statewide Automated Welfare Systems
24
“a set of processes and tools that consistently defines and manages
the non-transactional data entities of an organization (which may
include reference data).
MDM provides processes for collecting, aggregating, matching,
consolidating, quality-assuring, persisting and distributing such data
throughout an organization to ensure consistency and control
in the ongoing maintenance and application use of this information.”
Master Data Management
25
Master Data Management
Master Data Management
Master Data Foundation
Data Profiling Data Quality Data
Integration
Data Services Third Party Data
Applications
Relational and Flat Files
Legacy
Dashboard
Business Intelligence
Analytical
Operational
Applications
Legacy
Data
In
tegra
tio
n
Data
In
tegra
tio
n
Recognize
Resolve
Relate
26 26
2010s
Cloud, Big Data
27
Cloud Computing
Cloud computing is the use of computing
resources (hardware and software) which
are available in a remote location and
accessible over a network (typically the
Internet).
Cloud computing entrusts remote services
with a user's data, software and
computation
28
Big Data
Online
Transaction
Processing
(OLTP)
Online Analytical
Processing
(OLAP) &
DW Appliances
Social
Media Data
Device
Sensor Data
Scientific, genomic
Machine/Device
BIG TRANSACTION DATA BIG INTERACTION DATA
BIG DATA PROCESSING
Call detail
records, image,
click stream data
29
Government Use Cases
30
Health Insurance Exchange
• Background
• Affordable Care Act (“ObamaCare”) - States need to offer central eligibility and enrollment services for both State run and commercial health care plans
• HIE Integration Challenges
• State welfare systems and other eligibility systems
• State healthcare systems MMIS
• Federal Hub, providing eligibility and income services
• Commercial healthcare providers (qualified health plans)
31
MMIS
Health Insurance Exchange
Federal Hub
SAWS
Payers
• Database
• Multiple data source from multiple agencies
• ETL
• Data for current eligibility and reporting
• Data Quality
• Cleansing and standardizing disparate data
• XML / Web Services
• Integration with the Federal Hub
• Data Exchanges
• X12 – HIPAA transactions between payers and state
• ESB / Real-time
• Eligibility determination, messages based deployments
• Data Federation
• Deployments that don’t centralize data
• MDM
• Linking peoples stored in different systems
32
Statewide Longitudinal Data Systems
SLDS LINK:
Data is required to be shared and exchanged across multiple agencies
(human services, K-12, higher education, labor, corrections) and levels
(district, state, federal) to promote accountability, inform policy and
ensure a holistic view of student success.
What types of
preschool
programs are
most effective?
How well are
we preparing
students for
college?
Are college
students taking
remedial
classes?
Are college
graduates
prepared to
enter the
workforce?
33
Statewide Longitudinal Data Systems
• Database • Multiple data source from multiple agencies
• ETL • Demographic data for students in multiple
systems
• Data Quality • Cleansing and standardizing disparate data
• Data Federation
• Access to data from corrections and other agencies to support policy and summarization without PII
• MDM • Linking people stored in different systems
34
CO SLDS – VIDEO
35
Summary
• Need to define an enterprise integration strategy
• Need to incorporate message, process and data integration
architectures
• It will change so abstraction and modularity are important
• Define an SOA strategy but will be some time before this is a
universal data integration strategy
• MDM in concert with Data Integration and Data Quality are
integral to many Government data sharing initiatives
36