the dirty work -- why data must be reconciled
DESCRIPTION
The Briefing Room with Eric Kavanagh and the PSI-KORS Institute Live Webcast Nov. 12, 2013 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?AT=pb&SP=EC&rID=7727087&rKey=66b1fa7d82868199 Let's face it -- most enterprise information systems are a mess. That's often due to grunt work which was overlooked months or years ago and had nothing to do with you, except that you inherited it. Some mistakes can be swept under the rug for a while, but sooner or later, garbage in results in very expensive garbage out. Register for this episode of the Briefing Room to hear Senior Analyst Eric Kavanagh outline a roadmap from the past into the possible futures of the information economy. He'll be briefed by Dr. Geoffrey Malafsky, Founder and Data Scientist for the PSI-KORS Institute, a new organization focused on data reconciliation. Malafsky will share his institute's methodology and explain how the process of doing the dirty work can yield tremendous benefits. Visit InsideAnalysis.com for more informationTRANSCRIPT
Grab some coffee and enjoy the pre-show banter before the top of the hour!
The Briefing Room
The Dirty Work – Why Data Must Be Reconciled
Twitter Tag: #briefr
The Briefing Room
Welcome
Host & Analyst: Eric Kavanagh
Guest: Geoffrey Malafsky
Twitter Tag: #briefr
The Briefing Room
! Reveal the essential characteristics of enterprise software, good and bad
! Provide a forum for detailed analysis of today’s innovative technologies
! Give vendors a chance to explain their product to savvy analysts
! Allow audience members to pose serious questions... and get answers!
Mission
Twitter Tag: #briefr
The Briefing Room
Data Reconciliation
GIGO GARBAGE
DATA
GARBAGE RESULTS
GARBAGE RESULTS
PERFECT MODEL
GARBAGE MODEL
PERFECT DATA
Garbage In Garbage Out
§ Current data is disjointed and of low quality § Variable use and meaning among systems even for “same” data elements
§ Undocumented defini=ons and data mgmt processes § Errors in data systems § Disagreement among data systems § Lack of exis=ng descrip=ons for key readiness use cases
§ Legacy data systems have failed to overcome these problems despite several years of new marts/houses/brokers/IPTs/applica=ons
8
1. Wall Street Journal, CIO‘s Big Problem with Big Data, 2012-08-02 2. Forbes, The CEO/CMO Dilemma: So Much Data, So Little Impact, 2012-07-18
“Many CIOs believe data is inexpensive because storage has become inexpensive. But data is inherently messy – it can be wrong, it can be duplicative, and it can be irrelevant – which means it requires handling, which is where the real expenses come in. ‘The cost of more data is the application and the computing power and the processes to reconcile all these things’,” "While there are a myriad of analytical tools that can be leveraged, a recent study indicated that more than 70% of CMOs feel they are underprepared to manage the explosion of data and ‘lack true insight.’ “
§ Suffix in source A, prefix in B, neither in C for same (part number, =tle, …)? § Conflict syntac=cally (simplest case) and seman=cally (most difficult) § Other tools & methods never solve this because they deal with the obstacles independently or not at all: Data values out-‐of-‐sync with metadata, data models
Copyright Phasic Systems Inc 2013 9
NKY HomeSeekers Texas Different Meanings (Legal and Business Ac=vi=es)
1. Create table – =tle aligned to business = Garage 2. Create vocabulary: spaces.descrip=on, spaces.na=onal, spaces.state, . 3. Define ETL logic 4. Merge in warehouse and process in virtualiza=on layer 5. Change as needed
§ Data Ra=onaliza=on is the process of building and managing a con=nuously adap=ve data environment that fuels current and future business needs for decision making and system opera=ons
§ It ensures data (i.e. not just metadata) is as accurate, meaningful, and useful as possible while con=nuously adjus=ng to improve and add capability
§ It provides collabora=ve management of data assets, the designs governing who, why, and how of data , and the where, when, how of data use in opera=onal systems
§ It solves the great challenge of mapping all source values to each target along the en=re complex paths of enterprise data use § Consolidated values when possible with con=nuous improvement § Simplified and adap=ve mapping with Corporate NoSQL
10
Design Ra-onaliza-on Issues
• Mul=ple data models • Conflic=ng defini=ons • Similar, supposedly similar, opera=onally
dis=nct values • Unknown business logic • Mul=ple ETL mappings
System Ra-onaliza-on Issues
• Mul=ple database systems • Conflic=ng formats • Redundant storage • Unsynchronized values • Mul=ple integra=on points
Copyright Phasic Systems Inc 2013 11
Design Ra-onaliza-on • Consolidated, adap=ve data models • Standardized defini=ons • Synchronized dis=nct opera=onal values • Managed business logic • Coordinated ETL mappings
System Ra-onaliza-on • Consolidated, adap=ve systems • Common, interoperable formats • Common storage • Synchronized interfaces • Coordinated integra=on
Ra=onalized Data=Meaningful Analysis, Decision Support, Enterprise Applica=ons
Copyright Phasic Systems Inc 2013 12
13
§ Example from DARPA Evidence Extrac=on & Link Discovery
§ Today’s Situa=on: ~10k messages/day from mul=ple sources read by mul=ple analysts and analyzed in mul=ple manual non-‐integrated tools
§ Similar to Social Network Analysis
Copyright Phasic Systems Inc 2013 14
Complicated Mixture of Commercial, Custom, Legacy, Services Applica=ons, Data Stores
15 Copyright Phasic Systems Inc 2013
16
Costs Business Alignment: Goal, Capability, Architecture Data Assets: Systems, Owners, Use
Copyright Phasic Systems Inc 2013 17
The Ψ–KORS™ System Model
18
Point-select data models, codes, entities
Copyright Phasic Systems Inc 2013
19
Corporate NoSQL™
20 Copyright Phasic Systems Inc 2013
§ DOD CIO § Adap=vely blend financial and program data from mul=ple sources with unclear, undocumented alignment and integra=on logic (i.e. this is an intelligence challenge) into BI tools (QlikView, Tableau, PentaHo, Excel Web Apps-‐Sharepoint)
§ Export Development Canada § Ra=onalize core data distributed and undocumented to feed cross-‐enterprise governance and develop Enterprise Data Model with seman=cally adjudicated canonical en==es
§ Challenge: Complicated environment with conflic=ng data values, standards, business uses cases, and lack of documenta=on. Data owned by 4 major organiza=on, in mul=ple Warehouses and data stores, redundant non-‐reconciled sets of data
§ Requirement: Integrated, common, accurate data to enable new Integrated workforce planning, training, management applica=on (“Sailor of the Future”) for 1 million people
§ Prior Ac-vi-es: 10+ years of system integra=on, data warehouse, data governance efforts à no improvement, poor coordina=on across organiza=ons and systems
21
§ Yet, there were problems with the most basic data fields, which for the Navy, include things like § billet (effec=vely a job but also includes other characteris=cs),
§ rank (similar to seniority but with formal rules that change over =me),
§ ra=ng (similar to voca=onal ability but also with changing rules),
§ and even the primary iden=fier of a person the Social Security Number (SSN).
22
§ Bridge Organiza=ons, Processes, Technologies to Data Concepts
23
24
Logical Models derive directly from conceptual and use business terms
• Promulgate key technologies to help field overcome major obstacles • Iden=fy cause and existence of seman=c conflicts • Determine op=ons • Promote enterprise decision making on solu=on • Implement solu=on into opera=onal data • Visible direct line from governance to data modeling to integra=on to database engineering to analysis and back again
• Rapid cycle =me: iden=fy, assess, decide, execute con=nuously in natural organiza=onal =meline (days/weeks)
• Community version DataStar for non-‐commercial use • Collabora=ve community communica=on and design of common, seman=cally clear Corporate NoSQL models
Twitter Tag: #briefr
The Briefing Room
Twitter Tag: #briefr
The Briefing Room
Upcoming Topics
www.insideanalysis.com
November: DATA DISCOVERY & VISUALIZATION
December: INNOVATORS
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
Twitter Tag: #briefr
The Briefing Room
Thank You for Your
Attention