trends in data modeling

52
Trends in Data Modeling Presented by James Michael Lee and Peter Aiken, Ph.D.

Upload: dataversity

Post on 18-Aug-2015

289 views

Category:

Business


0 download

TRANSCRIPT

Trends in Data Modeling

Presented by James Michael Lee and Peter Aiken, Ph.D.

Welcome: Trends in Data Modeling

Date: August 11, 2015 Time: 2:00 PM ET Presented by: Peter Aiken, PhD Steven MacLauchlan Michael Lee

2Copyright 2015 by Data Blueprint Slide #

Businesses cannot compete without data. Every organization produces and consumes it. Data trends are hitting the mainstream and businesses are adopting buzzwords such as Big data, Data Vault, Data Scientist, etc., to seek solutions to their fundamental data issues. Few realize that the importance of any solution, regardless of platform or technology relies on the data model supporting it. Data modeling is not an optional task for an organization’s data remediation effort. Instead, it is a vital activity that supports the solution driving your business. This webinar will address emerging trends around data model application technology, as well as trends around the practice of data modeling itself. We will discuss abstract models and entity frameworks, as well as the general shift from data modeling being segmented to becoming more integrated with business practices. Takeaways: • NoSQL, data vault, etc., different and when should I apply them? • How Data Modeling relates to business process • Application development (data first, code first, object first?)

Steven MacLauchlan• 10 years of experience in Application

Development and Data Modeling with a focus on Healthcare solutions.

• Delivers tailored data management solutions that provide focus on data’s business value while enhancing clients’ overall capability to manage data

• Certified Data Management Professional (CDMP)

• Computer Science degree from Virginia Commonwealth University

• Most recent focus: Understanding emerging data modeling trends and how these can best be leveraged for the Enterprise.

3Copyright 2015 by Data Blueprint Slide #

Peter Aiken, Ph.D.• 30+ years in data management • Repeated international recognition • Founder, Data Blueprint (datablueprint.com) • Associate Professor of IS (vcu.edu)

• DAMA International (dama.org) • 9 books and dozens of articles • Experienced w/ 500+ data

management practices • Multi-year immersions:

– US DoD (DISA/Army/Marines/DLA) – Nokia – Deutsche Bank – Wells Fargo – Walmart – …

• DAMA International President 2009-2013

• DAMA International Achievement Award 2001 (with Dr. E. F. "Ted" Codd

• DAMA International Community Award 2005

PETER AIKEN WITH JUANITA BILLINGSFOREWORD BY JOHN BOTTEGA

MONETIZINGDATA MANAGEMENT

Unlocking the Value in Your Organization’s

Most Important Asset.

The Case for theChief Data OfficerRecasting the C-Suite to LeverageYour Most Valuable Asset

Peter Aiken andMichael Gorman

4Copyright 2015 by Data Blueprint Slide #

James “Michael” Lee• Data Consultant certified in a number of areas, including Data

Vault 2.0 Practitioner, Kimball ETL Architecture and Certified Data Management Professional (CDMP).

• Over 7 years of experience with – Designing data quality solutions

– improving data management practices

– implementing Data Governance frameworks

– architecting data warehouses

– implementation of system upgrades and migrations

• In the following industries:

– telecommunications

– banking

– insurance

– government (defense)

– commercial manufacturing

– international shipping

5Copyright 2015 by Data Blueprint Slide #

We believe ...

Data Assets

Financial Assets

RealEstate Assets

Inventory Assets

Non-depletable

Available for subsequent

use

Can be used up

Can be used up

Non-degrading √ √ Can degrade

over timeCan degrade

over time

Durable Non-taxed √ √

Strategic Asset √ √ √ √

• Today, data is the most powerful, yet underutilized and poorly managed organizational asset

• Data is your – Sole – Non-depleteable – Non-degrading – Durable – Strategic

• Asset – Data is the new oil! – Data is the new (s)oil! – Data is the new bacon!

• Our mission is to unlock business value by – Strengthening your data management capabilities – Providing tailored solutions, and – Building lasting partnerships

6Copyright 2015 by Data Blueprint Slide #

Asset: A resource controlled by the organization as a result of past events or transactions and from which future economic benefits are expected to flow [Wikipedia]

Trends in Data Modeling

Copyright 2015 by Data Blueprint

• Business to Data: the Relationship

• What is a Data Model?

• Conceptual, Logical, Physical • What issues can poor data modeling

introduce?

• Different Models, Different Uses

• Traditional (3NF, Star Schema, Data Vault)

• NoSQL Technologies (Key-Value/Document, Graph, Column Family)

• Trends

- Move to the business

- Self Service and Virtualization

- Agile

- Data Sharing World (The API’s) - Patterns and Reuse

- Metadata Modeling

7

What is a Data Model*?

• A data model organizes data elements and standardizes how the data elements relate to one another.

• In “Data Modeling Made Simple” by Steve Hoberman, he says: "A data model is a wayfinding tool for both business and IT professionals, which uses a set of symbols and text to precisely explain a subset of real information to improve communication within the organization and thereby lead to a more flexible and stable application environment."

8Copyright 2015 by Data Blueprint Slide #

*According to ANSI.

Why should we care about poor data models?• Poor data modeling up front can cause Data Quality issues “downstream”

• If the model isn’t a true representation of the business concepts, this will impact confidence in the data, inhibit business insights and innovation

• Potential for poor DB/Application performance for reads/writes. Example: Over-normalization

• Lack of flexibility can cause difficulty aligning with evolving business requirements

• Difficulty integrating data in the future

• Constrains business agility by complicating reengineering

• Creates operational inefficiencies (ex: poor application performance)

• Limits workflow transparency

• Proliferates system work-arounds, including shadow systems developed by end users

• Impact Analysis

9Copyright 2015 by Data Blueprint Slide #

How are Data Models Expressed as Architectures?• Attributes are organized into entities/objects

– Attributes are characteristics of "things"

– Entitles/objects are "things" whose information is managed in support of strategy

• Entities/objects are organized into models

– Combinations of attributes and entities are structured to represent information requirements

– Poorly structured data, constrains organizational information delivery capabilities

• Models are organized into architectures

– When building new systems, architectures are used to plan development

– More often, data managers do not know what existing architectures are and - therefore - cannot make use of them in support of strategy implementation

10Copyright 2015 by Data Blueprint Slide #

More Granular

More Abstract

The Conceptual Data Model• Represents entities and relationships

• Should Identify the domain and scope of data

• Should be easily understood by business users in order to communicate core data concepts, and drive application requirements

11Copyright 2015 by Data Blueprint Slide #

Example: We need to model customer address data. A customer may have many addresses, and many customers may share one address. “many to many”

DISPOSITION Data Map

12Copyright 2015 by Data Blueprint Slide #

Data map of DISPOSITION• At least one but possibly more system USERS enter the DISPOSITION facts into the system. • An ADMISSION is associated with one and only one DISCHARGE. • An ADMISSION is associated with zero or more FACILITIES. • An ADMISSION is associated with zero or more PROVIDERS. • An ADMISSION is associated with one or more ENCOUNTERS. • An ENCOUNTER may be recorded by a system USER. • An ENCOUNTER may be associated with a PROVIDER. • An ENCOUNTER may be associated with one or more DIAGNOSES.

13Copyright 2015 by Data Blueprint Slide #

ADMISSION Contains information about patient admission history related to one or more inpatient episodes

DIAGNOSIS Contains the International Disease Classification (IDC) of code representation and/or description of a patient's health related to an inpatient code

DISCHARGE A table of codes describing disposition types available for an inpatient at a FACILITY

ENCOUNTER Tracking information related to inpatient episodes

FACILITY File containing a list of all facilities in regional health care system

PROVIDER Full name of a member of the FACILITY team providing services to the patient

USER Any user with access to create, read, update, and delete DISPOSITION data

A sample data entity and associated metadata Entity: BED Data Asset Type: Principal Data Entity Purpose: This is a substructure within the Room

substructure of the Facility Location. It contains information about beds within rooms.

Source: Maintenance Manual for File and Table Data (Software Version 3.0, Release 3.1)

Attributes: Bed.Description Bed.Status Bed.Sex.To.Be.Assigned Bed.Reserve.Reason

Associations: >0-+ Room Status: Validated

14Copyright 2015 by Data Blueprint Slide #

• A purpose statement describing why the organization is maintaining information about this business concept;

• Sources of information about it; • A partial list of the attributes or characteristics of the entity; and • Associations with other data items; this one is read as "One room contains zero or

many beds."

The Logical Data Model

• Should represent the Conceptual Data model more thoroughly, but be otherwise very similar

• Will include attributes, names, relationships, and other metadata

• Will be developed using Data Modeling notation (ex: UML)

15Copyright 2015 by Data Blueprint Slide #

The Physical Data Model

• Describes the specific database implementation of the data

• Attributes will be named according to naming conventions

• Displays data types, accurate table names, Key information, etc

16Copyright 2015 by Data Blueprint Slide #

CM2 Component Evolution is technology derived but technology independent

17Copyright 2015 by Data Blueprint Slide #

Data Reengineering for More Shareable Data

18Copyright 2015 by Data Blueprint Slide #

Other logical as-is data architecture components

Data Modeling Framework

Conceptual Logical Physical

Goal

Validated

Not Validated

Copyright 2015 by Data Blueprint Slide # 19

Trends in Data Modeling

Copyright 2015 by Data Blueprint

• Business to Data: the Relationship

• What is a Data Model?

• Conceptual, Logical, Physical • What issues can poor data modeling

introduce?

• Different Models, Different Uses

• Traditional (3NF, Star Schema, Data Vault)

• NoSQL Technologies (Key-Value/Document, Graph, Column Family)

• Trends

- Move to the business

- Self Service and Virtualization

- Agile

- Data Sharing World (The API’s) - Patterns and Reuse

- Metadata Modeling

20

Normalization Rules Overview• 1st Normal Form - no repeating non-

key attributes for a given primary key

• 2nd Normal Form - no non-key attributes that depend on only a portion of the primary key

• 3rd Normal Form - no attributes depend on something other than the primary key

• 4th Normal Form - attributes depend on not only key but the value of the key

• 5th Normal Form - an entity is in 5NBF if its dependencies on occurrences of the same entity of entity type have been moved into a structured entity

21Copyright 2015 by Data Blueprint Slide #

The row in every table is dependent on the key, the whole key and northern but the key

Third Normal Form• Each attribute in the relationship is a fact about a key

• Highly normalized structure

22Copyright 2015 by Data Blueprint Slide #

• Use Cases:

– Transactional Systems. – Operational Data Stores.

Third Normal Form: Pros and Cons• Pros

– Easily understood by business and end users

– Reduced data redundancy

– Enforced referential integrity

– Indexed attributes/flexible querying

• Cons

– Joins can be expensive

– Does not scale

23Copyright 2015 by Data Blueprint Slide #

Neo4j.com

Star Schema

24Copyright 2015 by Data Blueprint Slide #

• Comprised of “fact tables” that contain quantitative data, and any number of adjoining “dimension” tables

• Optimized for business reporting

• Use Cases: – OLAP (Online Analytic Processing)

– BI

Wikipedia

Star Schema Pros and Cons• Pros

– Simple Design

– Fast Queries

– Most major DBMS are optimized for Star Schema Designs

• Cons

– Questions must be built into the design

– Data marts are often centralized on one fact table

25Copyright 2015 by Data Blueprint Slide #

Data Vault• Designed to facilitate long-term historical storage, focusing on ease

of implementation

• Retains data lineage information (source/date)

• “All the data, all the time”. Hybrid approach of Inmon and Kimball. • Comprised of Hubs (which contain a list of business keys that do not

change often), Links (Associations/transactions between hubs), and Satellites (descriptive attributes associated with hubs and links)

26Copyright 2015 by Data Blueprint Slide #

• Use Cases: – Data Warehousing – Complete Auditability

Bukhantsov.org

Data Vault Pros and Cons• Pros

– Simple integration

– Houses immense amounts of data with excellent performance

– Full data lineage captured

• Cons

– Complication is pushed to the “back end”

– Can be difficult to setup for many data workers

– No widespread support for ETL tools yet

27Copyright 2015 by Data Blueprint Slide #

Model Comparison Matrix

28Copyright 2015 by Data Blueprint Slide #

3NF Dimensional Vault

Scalability ☑ ☑ ☑Flexibility ☒ ☒ ☑Reengineering ☒ ☒ ☑

Auditability ☑

Business Interpretable ☑ ☑ ☒

Presentation Layer ☒ ☑ ☒

Performance ☒ ☑ ☑

Support ☑ ☑

29Copyright 2015 by Data Blueprint Slide #

Technology Trigger: A potential technology breakthrough kicks things off. Early proof-of-concept stories and media interest trigger significant publicity. Often no usable products exist and commercial viability is unproven.

Trough of Disillusionment: Interest wanes as experiments and implementations fail to deliver. Producers of the technology shake out or fail. Investments continue only if the surviving providers improve their products to the satisfaction of early adopters.

Peak of Inflated Expectations: Early publicity produces a number of success stories—often accompanied by scores of failures. Some companies take action; many do not.

Slope of Enlightenment: More instances of how the technology can benefit the enterprise start to crystallize and become more widely understood. Second- and third-generation products appear from technology providers. More enterprises fund pilots; conservative companies remain cautious.

Plateau of Productivity: Mainstream adoption starts to take off. Criteria for assessing provider viability are more clearly defined. The technology’s broad market applicability and relevance are clearly paying off.

Gartner Five-phase Hype Cycle

30Copyright 2015 by Data Blueprint Slide #

2012 Hype Cycle

2012 Big Data in Hype Cycle

31Copyright 2015 by Data Blueprint Slide #

2013 Big Data in Hype Cycle

32Copyright 2015 by Data Blueprint Slide #

2014 Big Data in Hype Cycle

33Copyright 2015 by Data Blueprint Slide #

"A focus on big data is not a substitute for the fundamentals of information management."

NoSQL Solutions*• Document/Key Value

– “Schema-less” design empowers developers*

– Scalable

– High availability

– Economically viable (scale out not up!)

• RDF/Triple Store

– Purpose-built to store triples (“bob likes football”)

– SPARQL is a query language specific to RDF. – One of the pillars of “Semantic Web”

• Graph

– Structure comprised of “nodes”, “edges”, and “properties”

– Focused on the interconnection between entities

– Fast queries to find associative data

• Column Family

– Columns are stored individually (but clustered by “family” unlike traditional columnar databases)

– By only querying specific column families, we can have nearly unlimited numbers of columns without causing expensive queries

34Copyright 2015 by Data Blueprint Slide #

*not exhaustive!

NoSQL Data Models

35Copyright 2015 by Data Blueprint Slide #

RDF/Triple Store

Graph (Source: Neo4J)

Document Store (Source: MongoDB)Column Store (Source: Toadworld)

NoSQL providers

36Copyright 2015 by Data Blueprint Slide #

Wikibon.org

Example: Marvel’s Data Model

37Copyright 2015 by Data Blueprint Slide #

Trends in Data Modeling

Copyright 2015 by Data Blueprint

• Business to Data: the Relationship

• What is a Data Model?

• Conceptual, Logical, Physical • What issues can poor data modeling

introduce?

• Different Models, Different Uses

• Traditional (3NF, Star Schema, Data Vault)

• NoSQL Technologies (Key-Value/Document, Graph, Column Family)

• Trends

- Move to the business

- Self Service and Virtualization

- Agile

- Data Sharing World (The API’s) - Patterns and Reuse

- Metadata Modeling

38

Move it to the Business• Models need to add value

• Models need to be part of the process

– (Not a documentation of the process) • Models need to assist in improving capabilities, not

hindering them

– Self Service BI

39Copyright 2015 by Data Blueprint Slide #

Self Service and Virtualization

• Self Service BI requires end user understanding of the system

• Presentation Data Models

40Copyright 2015 by Data Blueprint Slide #

Agile• Incremental build of models

– Not an excuse to create bad models

• 80/20 Rule

• The problem with code first

– Rules exist in code

– Reengineering concerns

– Governance concerns

– Lack Business Insights

• Database First

– Creates value in modeling

– Enforced integrity and lineage of the data

– Integrates the model into the process

– Used to generate code

41Copyright 2015 by Data Blueprint Slide #

A Data Sharing World

• Adding structure to information allows us to obtain exactly what we want, when we want it.

• Allows applications to serve up data to external sources in a structured way- “Post-schema”.

42Copyright 2015 by Data Blueprint Slide #

Design Patterns• Why are the restrooms generally in the same place in each building?

• What about the electrical wiring?

• HVAC? Floorplans? ... • Architecture design patterns (spoke and hub,

hub of hubs, warehouse, cloud, MDM, changing tires, portal)

43Copyright 2015 by Data Blueprint Slide #

Patterns and Reuse• Common rule of thumb:

– One third of a data model contains fields common to all business.

– One third contains fields common to the industry, and the

– Other third is specific to the organization.

• Patterns should theoretically provide an organization with a base-line to quickly develop data infrastructure.

• Off-the-shelf solutions may require in-depth customization or specialization.

44Copyright 2015 by Data Blueprint Slide #

Source:http://dmreview.com/article_sub.cfm?articleID=1000941 used with permission

Meta Data Models

45Copyright 2015 by Data Blueprint Slide #

Marco & Jennings's Metadata ModelSource:http://dmreview.com/article_sub.cfm?articleID=1000941 used with permission

46Copyright 2015 by Data Blueprint Slide #

Trends in Data Modeling

Copyright 2015 by Data Blueprint

• Business to Data: the Relationship

• What is a Data Model?

• Conceptual, Logical, Physical • What issues can poor data modeling

introduce?

• Different Models, Different Uses

• Traditional (3NF, Star Schema, Data Vault)

• NoSQL Technologies (Key-Value/Document, Graph, Column Family)

• Trends

- Move to the business

- Self Service and Virtualization

- Agile

- Data Sharing World (The API’s) - Patterns and Reuse

- Metadata Modeling

47

Conclusions• Data Modeling is

important to get right. • Getting it “right” is

hugely dependent on the business case, maturity of the organization, flexibility for future growth, and so much more.

• There are many technologies and ideas available to help solve a number of problems.

• Don't try any of this without considering the various architectures involved

48Copyright 2015 by Data Blueprint Slide #

Questions?

49Copyright 2015 by Data Blueprint Slide #

It’s your turn!

Use the chat feature or Twitter (#dataed) to submit

your questions to Peter, Michael and Steven now.

Upcoming EventsData Quality Success Stories

September 8, 2015

@ 2:00 PM ET/11:00 AM PT

Design & Manage Data Structures

October 13, 2015 @ 2:00 PM ET/11:00 AM PT

Sign up here:

• www.datablueprint.com/webinar-schedule • or www.dataversity.net

50Copyright 2015 by Data Blueprint Slide #

Sources• Data model. (2014, October 7). In Wikipedia, The Free

Encyclopedia. Retrieved October 7, 2014, from http://en.wikipedia.org/w/index.php?title=Data_model&oldid=628639882

• Data Modeling 101. (2006). In Agile Data. Retrieved October 7, 2014, from http://www.agiledata.org/essays/dataModeling101.html

51Copyright 2015 by Data Blueprint Slide #