healthcare best practices in data warehousing & analytics

121
Data Warehousing A Look Back, Moving Forward Dale Sanders June 2005

Upload: dale-sanders

Post on 23-Feb-2017

65 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Healthcare Best Practices in Data Warehousing & Analytics

DataWarehousing

A Look Back, Moving Forward

Dale SandersJune 2005

Page 2: Healthcare Best Practices in Data Warehousing & Analytics

2

Introduction & Warnings Why am I here?

Teach Stimulate some thought Share some of my experiences and lessons

Learn From you, please… Ask questions, challenge opinions, share your knowledge

I’ll do my best to live up to my end of the bargain

Warnings The pictures in this presentation

May or may not have any relevance whatsoever to the topic or slide Mostly intended to break up the monotony

Page 3: Healthcare Best Practices in Data Warehousing & Analytics

3

Expectation Management DW Strengths (according to others)

I know what not to do as much as I know what to do Seen and made all the big mistakes

Vision, strategy, system architecture, data management & DW modeling, complex cultural issues, “leapfrog” problem solving

What not to expect: DW weaknesses My programming skills suck

Haven’t written a decent line of code in four years! Some might say it’s been 24 years…

Knowledge of leading products is very rusty Though I’m beefing up on Microsoft and Cognos

Within these expectations, make no mistake about it… I know data warehousing

Page 4: Healthcare Best Practices in Data Warehousing & Analytics

4

Today’s Discussions

I am a good “Idea Guy” But, ideas are worthless without someone to implement and

enhance them Steve Barlow, Dan Lidgard, Jon Despain, Chuck Lyon, Laure

Shull, Kris Mitchell, Peter Hess, Ron Gault, Rob Carpenter, my wife, and many others

My greatest strength and blessing The ability to recognize, listen to, and hold onto good people Knock on wood

My achievements in personal and professional life More a function of those around me than a reflection on me

Page 5: Healthcare Best Practices in Data Warehousing & Analytics

5

DW Best Practices: The Most Important Metrics

Employee satisfaction Without it, long-term customer satisfaction is impossible

Customer satisfaction That’s the nature of the Information Services career field Some people in our profession still don’t get it

We are here to serve

The Organizational Laugh Metric How many times do you hear laughter in the day-to-day

operations of your team? It is the single most important vital sign to organizational health

and business success

Page 6: Healthcare Best Practices in Data Warehousing & Analytics

6

My Background Three, eight-year chapters

Captain, Information Systems Engineer, US Air Force Nuclear warfare battle management Force status data integration Intelligence and attack warning data “fusion”

Consultant in several industries TRW

CIA Data Center TRW Credit Reporting Data Base

National Security Agency (NSA) Intel: New Mexico Data Repository (NMDR) Air Force

Integrated Minuteman Data Base (IMDB) Peacekeeper Information Retrieval System (PIRS)

Many others… Healthcare

Intermountain Health Care Enterprise Data Warehouse Consultant to other healthcare organizations’ data warehouses Now at Northwestern University Medical System

Page 7: Healthcare Best Practices in Data Warehousing & Analytics

7

Overview Data warehousing history

According to Sanders Why and how did this become a sub-specialty in information

systems? What have we learned so far?

My take on “Best Practices” Key lessons-learned

My thoughts on the most popular authors in the field What they contribute, where they detract

Page 8: Healthcare Best Practices in Data Warehousing & Analytics

8

Data Warehousing History

“Newspaper Rock”100 B.C.

American Retail2005 A.D.

Lots of stuff happened

Page 9: Healthcare Best Practices in Data Warehousing & Analytics

9

What Happened in the Cloud? Stage 1: Laziness

Operators grew tired of hanging tapes In response to requests for historical financial data

They stored data on-line, in “unauthorized” mainframe databases

Stage 2: End of the mainframe bully Computing moved out from finance to the rest of the business Unix and relational databases Distributed computing created islands of information

Stage 2.1: The government gets involved Consolidating IRS and military databases to save money on mainframes “Hey, look what I can do with this data…”

Stage 3: Demming comes along Push towards constant business “reengineering” Cultural emphasis on “continuous quality improvement” and “business innovation” drives the need for data

Stage 4: Data warehousing has it’s own language Ralph Kimball publishes “The Data Warehouse Toolkit”

Page 10: Healthcare Best Practices in Data Warehousing & Analytics

10

The Real Truth Data warehousing is a symptom of a problem

Technological inability to deploy single-platform information systems that: Capture data once and reuse it throughout an

enterprise Support high-transaction rates (single record

CREATE, SELECT, UPDATE, DELETE) and analytic queries on the same computing platform, with the same data, at the same time

Someday, maybe we will address the root cause Until then, it’s a good way to make a living

Page 11: Healthcare Best Practices in Data Warehousing & Analytics

11

The “Ideal Library” Practice Stores all of the books and other reference material you need to

conduct your research The Enterprise data warehouse

A single place to visit One database environment

Contents are kept current and refreshed Timely, well choreographed data loads

Staffed with friendly, knowledgeable people that can help you find your way around Your Data Warehouse team

Organized for easy navigation and use Metadata Data models “User friendly” naming conventions

Page 13: Healthcare Best Practices in Data Warehousing & Analytics

13

Business Culture

Does your CEO… Talk about constant improvement, constantly? Drive corporate goals that are SMART?

Specific, Measurable, Attainable, Realistic, Tangible

Crave data to make better informed decisions? Become visibly, buoyantly excited at a demo for

a data cube?

If so, the success of your data warehouse is right around the corner… sort of…

I love data!

Page 14: Healthcare Best Practices in Data Warehousing & Analytics

14

Political Best Practices You will be called a “data thief”

Get used to it Encourage life cycle ownership of the OLTP

data, even in the EDW You will be called “dangerous”

“You don’t understand our data!” OLTP owners know their data better than you

do– acknowledge it and leverage it

You will be blamed for poor data quality in the OLTP systems This is a natural reaction Data warehouses raise the visibility of poor data quality Use the EDW as a tool for raising overall data quality

You will be called a “job robber” EDW is perceived as a replacement for OLTP systems Educate people: The EDW depends on OLTP systems for its existence

Stick to your values and pure motives The politics will fade away

Page 15: Healthcare Best Practices in Data Warehousing & Analytics

15

Data Quality Pitfall

Taking accountability for data quality on the source system Spending gobs of time and money “cleansing” data before it’s loaded into

the DW It’s a never ending, never win battle You will always be one step behind data quality You will always be in the cross-hairs of blame

Best Practice Push accountability where it belongs– to the

source system Use the data warehouse as a tool to reveal

data quality, either good or bad Be prepared to weather the initial storm of

blame

Page 16: Healthcare Best Practices in Data Warehousing & Analytics

16

Measuring Data Quality Data Quality = Completeness x Validity

Can it be measured objectively?

Measuring “Completeness” Number of null values in a column

Measuring “Validity” Cardinality is a simple way to measure validity

“We only have four standard regions in the business, but we have 18 distinct values in the region column.”

Page 17: Healthcare Best Practices in Data Warehousing & Analytics

17

Business Validity How can you measure it? You can’t…

“I collect this data from our customers, but I have to guess sometimes because I don’t speak Spanish.”

“This data is valid for trend analysis decisions before 9/11/2001, but should not be used after that date, due to changes in security procedures.”

“You can’t use insurance billing and reimbursement data to make clinical, patient care decisions.”

“This customer purchased four copies of ‘Zamfir, Master of the Pan Flute’, therefore he loves everything about Zamfir.” What Amazon didn’t know: I bought them for my mom and her

sewing circle.

Where do you capture subjective data quality? Metadata….

Page 18: Healthcare Best Practices in Data Warehousing & Analytics

18

The Importance of Metadata

Maybe the most over-hyped, underserved area of data warehousing common sense Vendors want to charge you big $$$$$ for their tools Consultants would like you to think that it’s the Holy Grail in

disguise and only they can help you find it Authors who have never been in an operational environment

would have you chasing your tail in pursuit of an esoteric, mythological Metadata Nirvana

Don’t listen to the confusing messages! You know the answer… just listen to your common sense…

Page 19: Healthcare Best Practices in Data Warehousing & Analytics

19

Metadata: Keep It Simple! Ultimately, what are the most valuable business

motives behind metadata? Make data more “understandable” to those who are not

familiar with it Data quality issues Data timeliness and temporal issues Context in which is was collected Translating physical names to natural language

Make data more “findable” to those who don’t know where it is Organize it

Take a lesson from library science and the card catalog

Page 20: Healthcare Best Practices in Data Warehousing & Analytics

20

Table ElementsRequired Elements

Long Name (or English name) Description

Semi-optional Elements Source Example Data Steward

Page 21: Healthcare Best Practices in Data Warehousing & Analytics

21

Column ElementsRequired Elements

Long Name Description

Optional Elements Value Range Data Quality Associated Lookup

Page 22: Healthcare Best Practices in Data Warehousing & Analytics

22

The Data Model

TABLE_ENTTABLE_ENT_ID: NUMBER

TABLE_ENT_DESC: VARCHAR2(4000)TABLE_ENT_SRC: VARCHAR2(50)TABLE_ENT_NAME: VARCHAR2(50)TABLE_TYPE: VARCHAR2(10)CREATE_DT: DATELAST_LOAD_DT: DATESCHEMA_ID: NUMBER

DATA_MARTDATA_MART_ID: NUMBER

DATA_MART_NAME: VARCHAR2(50)DATA_MART_DESC: VARCHAR2(4000)DATA_STEWARD: VARCHAR2(50)LAST_LOAD_DT: DATEUPDATE_FREQ: VARCHAR2(50)DATA_BEG_DT: DATEDATA_END_DT: DATE

DATA_MART_TABLE_ENTDATA_MART_ID: NUMBERTABLE_ENT_ID: NUMBER

FOLDERFOLDER_ID: NUMBER

PARENT_FOLDER_ID: NUMBERFOLDER_NM: VARCHAR2(50)FOLDER_DSC: VARCHAR2(4000)CREATE_USER_ID: VARCHAR2(20)CREATE_DT: DATE

REPORTRPT_ID: NUMBER

FOLDER_ID: NUMBERRPT_NM: VARCHAR2(250)RPT_LOC_TXT: VARCHAR2(1000)PURPOSE_TXT: VARCHAR2(4000)RUN_FREQ_TXT: VARCHAR2(1000)AUDIENCE_TXT: VARCHAR2(500)EDW_RPT_FLG: NUMBERDATA_SOURCE_TXT: VARCHAR2(4000)SELECT_CRITERIA_TXT: VARCHAR2(4000)STAT_METHODS_TXT: VARCHAR2(4000)RPT_TOOL_TXT: VARCHAR2(250)CODE_TXT: CLOBFORMULA_TXT: CLOBCOMMENTARY_TXT: VARCHAR2(4000)AUTHOR_NM: VARCHAR2(500)AUTHOR_TITLE_TXT: VARCHAR2(500)AUTHOR_DEPT_TXT: VARCHAR2(500)AUTHOR_LOC_TXT: VARCHAR2(500)AUTHOR_PHONE_TXT: VARCHAR2(500)AUTHOR_EMAIL_TXT: VARCHAR2(500)BUSINESS_OWNER_TXT: VARCHAR2(500)METADATA_UPDATE_DT: DATEVALIDATION_DT: DATECREATE_USER_ID: VARCHAR2(20)CREATE_DT: DATE

REPORT_TABLE_ENT_ASSOCRPT_ID: NUMBERTABLE_ENT_ID: NUMBER

ATTRIBUTEATTRIBUTE_ID: NUMBER

TABLE_ENT_ID: NUMBERATTRIBUTE_DESC: VARCHAR2(4000)ATTRIBUTE_NAME: VARCHAR2(50)ATTRIBUTE_DATATYPE: VARCHAR2(50)SAMPLE_VALUE: VARCHAR2(100)INDEX_FLG: NUMBERPRIMARY_KEY_FLG: NUMBERTABLE_POSITION_NO: NUMBER

SCHEMASCHEMA_ID: NUMBER

SCHEMA_DESC: VARCHAR2(50)

Page 23: Healthcare Best Practices in Data Warehousing & Analytics

23

Example Metadata EntryLKUP.POSTAL_CD_MASTER TableLong Name:

Postal Code Master - IHC

Description:Contains Postal (Zip) codes for the IHC referral region and

IHC specific descriptions. These descriptions allow for specific IHC groupings used in various analyses.

Data Steward: Jim Allred, ext. 3518

Page 24: Healthcare Best Practices in Data Warehousing & Analytics

24

Metadata on the Web

Page 25: Healthcare Best Practices in Data Warehousing & Analytics

25

Some Info Is Free It can be collected from the database.For example:

Primary and Foreign Keys Indexed Columns Table Creation Dates

Page 26: Healthcare Best Practices in Data Warehousing & Analytics

26

Most Valuable Info is SubjectiveThe human element Most metadata is not automatically

collected by tools because it does NOT exist in that form

Interviews with data stewards are the key

It can take months (and months and months) of effort to collect initial metadata.

Page 27: Healthcare Best Practices in Data Warehousing & Analytics

27

Holding Feet to the Fire Made data architects

responsible for metadata in their subject areas

Metadata completion reports in every staff meeting for a year

Standing rule: No new data marts go live without metadata

Page 28: Healthcare Best Practices in Data Warehousing & Analytics

28

Is it all worth it?

Data analysts think so.

“I couldn’t do my job without it.”

It will push the ROI of a home-hum DW into the stratosphere

It does for DW’ing what the Yellow Pages did for the business ROI of the telephone

Page 29: Healthcare Best Practices in Data Warehousing & Analytics

29

It Gets UsedAt Intermountain Health Care

210 web hits on average each week day (23,000 employees, $2B revenue)

Avg Hits by Day of Week(April 2004 - Sep 2004)

189217 212

240

188

0

50

100

150

200

250

300

MON TUE WED THU FRI

Page 30: Healthcare Best Practices in Data Warehousing & Analytics

30

“What’s New”

Page 31: Healthcare Best Practices in Data Warehousing & Analytics

31

Report Quality A function of…

Data quality How well does the report reflect the intent behind the question being

asked? “This report doesn’t make sense. I’m trying to find out how many

widgets we can produce next year, based on the last four years’ production.”

“That’s not what you asked for.” SQL and other programming accuracy Statistical validity– population size of the data Timeliness of the data relative to the decision Event Correlation

Best Practice: An accompanying “meta-report” for every report that involves

significant, high risk decisions

Page 32: Healthcare Best Practices in Data Warehousing & Analytics

32

Meta Report

A document, associated with a published report, which defines the report.

Page 33: Healthcare Best Practices in Data Warehousing & Analytics

33

Repository

A central place for storing and sharing information about business reports

Page 34: Healthcare Best Practices in Data Warehousing & Analytics

34

IHC Analyst Use of Meta Reports

37%

89%

21%

95%

0%

20%

40%

60%

80%

100%

Data Collected Aug-04 N=32Read Others Search Duplication Search SQL Audience Request

Page 35: Healthcare Best Practices in Data Warehousing & Analytics

35

Meta Report

Core Elements Author Information Report Name Report Purpose Data Source(s) Report Methods

Recommended Elements Business Owner Run Frequency Intended Audience Statistical Tests Software Used Source Code Formulas Relevant Issues &

Commentary

Page 36: Healthcare Best Practices in Data Warehousing & Analytics

36

• Title

• Location

• Author

• Owner

Page 37: Healthcare Best Practices in Data Warehousing & Analytics

37

• Purpose

• Frequency

• Audience

• Data Source(s)

Page 38: Healthcare Best Practices in Data Warehousing & Analytics

38

• Selection Criteria

• Statistics

• Software

• Source Code

• Formulas

Page 39: Healthcare Best Practices in Data Warehousing & Analytics

39

What’s It Look Like?

Page 41: Healthcare Best Practices in Data Warehousing & Analytics

41

Utilization and Creation Rate

Error

Page 42: Healthcare Best Practices in Data Warehousing & Analytics

42

Think: Mission Control Customized ETL Library Schedule of operations Alerting tool Storage strategies / backups Development philosophy and environment Performance—monitoring and tuning

Operations Best Practices

Page 43: Healthcare Best Practices in Data Warehousing & Analytics

43

EDW Oracle v 9.2.0.3 on AIX 5.2 Storage: IBM SAN (shark), >3T

ETL tools Ascential’s Data Stage Kornshell (unix), SQL scripts, PL/SQL scripting

OLAP: MS’ Analysis Services BI: Business Objects (Crystal Enterprise)

With a Cube presentation layer Dashboard: Visual Mining’s Net Charts EDW Team: ~16 FTEs, plus SAs and DBAs

IHC Architecture

Page 44: Healthcare Best Practices in Data Warehousing & Analytics

44

CustomizedETL

Library

Page 45: Healthcare Best Practices in Data Warehousing & Analytics

45

One of our ETL programmers noticed he kept doing the same things over and over for all of his ETL jobs. Rather than copying and pasting this repetitive code, he created a library. Now we all use the ETL Library.

We named the library EDW_UTIL (EDW Utilities)

History

Page 46: Healthcare Best Practices in Data Warehousing & Analytics

46

Implementation Executes via Oracle stored procedures Supported by associated tables to hold data

when necessary Error table QA table Index table

Page 47: Healthcare Best Practices in Data Warehousing & Analytics

47

Benefits Provides standardization Eliminates code rewrites Can hide complexities Such as the appropriate way to analyze and gather statistics on tables Very accessible to all of our ETL tools Simply an Oracle stored procedure call

Page 48: Healthcare Best Practices in Data Warehousing & Analytics

48

Index Management Past process included:

Dropping the table’s indexes with a script Loading the table Creating the indexes with a script

The past process resulted in messy scripts to manage and coordinate

Page 49: Healthcare Best Practices in Data Warehousing & Analytics

49

Index Management New process includes:

Capturing a table’s existing indexes metadata Dropping the table’s indexes with a single procedure call Loading the table Recreating the indexes with a single procedure call

There are no more messy scripts to manage and coordinate No “lost” indexes were neglected when adding to create index script

Page 50: Healthcare Best Practices in Data Warehousing & Analytics

50

Index Management Samples

IMPORT_SCHEMA_INDEX_DATA IMPORT_TABLE_INDEX_DATA DROP_TABLE_INDEXES CREATE_TABLE_INDEXES

Page 51: Healthcare Best Practices in Data Warehousing & Analytics

51

Background Loading of Tables We often load data into tables which are not

accessible to end users. A simple rename puts them into production.

Helps transfer the identical attributes from the live to the background table

Samples COPY_TABLE_METADATA TRANSFER_TABLE_PRIVS DROP_TABLE_INDEXES CREATE_TABLE_INDEXES

(Create on background table, identical to production table)

Page 52: Healthcare Best Practices in Data Warehousing & Analytics

52

Load Times, Errors, QA We had no idea who was loading what and when

Each staff member logged in their own way and for their own interest

ETL error capturing and QA was difficult We can now capture errors and QA information in a

somewhat standardized fashion

Page 53: Healthcare Best Practices in Data Warehousing & Analytics

53

Load Times, Errors, QASamples BEGIN_JOB_TIME

(ex: CASEMIX) BEGIN_LOAD_TIME

(ex: CASEMIX INDEX) END_LOAD_TIME END_JOB_TIME COMPLETE_LOAD_TIME

(Begin and end together) LOAD_TIME_ERROR

(Alert on these errors) LOAD_TIME_METRICS

QA (row counts)

Page 54: Healthcare Best Practices in Data Warehousing & Analytics

54

Miscellaneous Procedures Hide the “gory” details from the majority

of the EDW team Such as Oracle’s table analyze command

Gives us consistent application of system wide parameters such as: A new box with a different number of CPUs

(parallel slaves)or

A new version of Oracle We populate some metadata too, such

as last load date

Page 55: Healthcare Best Practices in Data Warehousing & Analytics

55

DW Schedule of Operations Some loads are adhoc, not scheduled Users query in an adhoc fashion We have a minimal service/application tier

implemented (loss of control) Use of a variety of ETL tools Use of a variety of user categories

DBA, SA, ETL user, end users Use of a variety of servers

Production EDW, Stage EDW, ETL servers, OLAP servers, Presentation layer servers

Page 56: Healthcare Best Practices in Data Warehousing & Analytics

56

General Approach Focus on load jobs against production EDW

Still working on all the reporting aspects (a sample on the next slide)

Pull this information out of the “load times” data captured by these ETL library calls BEGIN_JOB_TIME BEGIN_LOAD_TIME END_LOAD_TIME END_JOB_TIME COMPLETE_LOAD_TIME

Page 58: Healthcare Best Practices in Data Warehousing & Analytics

58

DW Alerting Tool DW alerting

Aggregate data alerts, such as, your average length of stay just crossed a certain threshold

A simple tool was created which sends a text email, based on existence of data returned from a query

Primarily embraced by DW team members for internal DW operations, not that the original intent is abandoned

Page 59: Healthcare Best Practices in Data Warehousing & Analytics

59

Features Web based Open to all EDW users Run daily, weekly, every two weeks, monthly,

quarterly (wakes every 5 minutes) This is a passive polling

Ability to enter query in SQL Alert (email) on 3 situations

Query returns data Query returns no data Always

Page 61: Healthcare Best Practices in Data Warehousing & Analytics

61

Examples ~100 alerts in use Live performance check

Every 4 hours—look for inactive sessions holding active slaves

Daily—look for any active sessions older than 72 hours ETL monitoring; alert only if problem

Alert on errors logged via the ETL_UTIL library (manage by exception)

Alert on existence of “bad” records captured during ETL

Page 62: Healthcare Best Practices in Data Warehousing & Analytics

62

Storage and Backup Inherited state of affairs Running like any OLTP database

High end expensive SANs (storage area networks)

FULL nightly online backups Out of space? Just buy more

Page 63: Healthcare Best Practices in Data Warehousing & Analytics

63

Nightmare in the Making Exponential growth

More data sources More summary tables More indexes No data has yet been purged

Relaxed attitude Disk is cheap Reality: Disk management is expensive

Page 64: Healthcare Best Practices in Data Warehousing & Analytics

64

Looming Crisis Backups often run 16 hours or more

Performance degradation witnessed by users Good backups obtained less than 50% of the time

Literally running out of space Gross underestimating Some reckless overuse

Financial $$$$ cost The system administrators (SAs) quadruple the price of

disk purchase from the previous budget year. Ouch! SAs roll in the price of tape drives, etc.

Page 65: Healthcare Best Practices in Data Warehousing & Analytics

65

Major Changes in Operations

Transfer some disk ownership AND backup responsibilities to DW team, away from SAs and DBAs

EDW team more aware of upcoming space demands

EDW team more in tune with which data sets are easily recreated from the source (don’t need a backup)

Stop performing full daily backups Move towards less expensive disk

option IBM offers a few levels of SANs

Page 66: Healthcare Best Practices in Data Warehousing & Analytics

66

Tracking and PredictingStorage Use

Page 67: Healthcare Best Practices in Data Warehousing & Analytics

67

Changes to Backup Strategy Perform full backup once monthly during

downtime Perform no data backup on DEV/STAGE

environments

Do backup DDL (all code) daily in all environments

Implement daily “incremental” backup

Page 68: Healthcare Best Practices in Data Warehousing & Analytics

68

Daily Incremental Backups Easier said than done We’ve resorted to a table level backup (in Oracle,

that’s an EXPORT) The EDW team owns which tables are exported

EDW team populates a table, the “export table list” with each table’s export frequency

Populated via an application in development The DBA’s run an export based on the “export table

list”

Page 69: Healthcare Best Practices in Data Warehousing & Analytics

69

Use Cheaper Disk General practice: You can take greater risks with DW reliability

and availability vs. OLTP systems Use it to your advantage

Our SAN vendor (IBM) offers a few levels of SANs. Next level down is a big step down in price, small step down in features.

Feature loss: Read cache (referring to disk cache, not box memory).

We rarely read the same thing twice anyway No “phone home” to IBM (auto paging) Mean time to failure is higher, but still acceptable

Page 70: Healthcare Best Practices in Data Warehousing & Analytics

70

Performance Monitoring & Tuning Err on the side of freedom and empowerment

How much harm can really be done? We’d rather not constrain our customers

“Pounding queries” do find their way to production Opportunity to educate users Opportunity for us to tune

underlying structures

Page 71: Healthcare Best Practices in Data Warehousing & Analytics

71

The Focus Areas Indexing

Well-defined criteria for when and how to apply indexes Is this a lost art?

Big use of BITMAPS Composite index trick (acts like a table)

Partitioning for performance, rather than data management Exploiting Oracle’s Direct Path INSERT feature Avoiding UPDATE and DELETE commands

Copy with MINUS instead Implementing Oracle's Parallel Query Turn off referential integrity in the DW.. no brainer

That’s the job of the source system

Page 72: Healthcare Best Practices in Data Warehousing & Analytics

72

DW Monitoring: Empowering End Users

Motive Too many calls from end users about their queries

“Please kill it.” “Is it still running or is my PC locked up?” “Why is the DW so slow?”

Give them the insight and tools Give them the ability to kill their own queries

Still in the works

Page 74: Healthcare Best Practices in Data Warehousing & Analytics

74

Tracking Long-Running Queries

We use Pinecone (from Ambeo) to monitor the duration of all queries and the SQL

Each week, we look at the top few Typical outcome?

We’ll add indexes We’ll denormalize We'll contact the user and assist them with writing a better query

Page 75: Healthcare Best Practices in Data Warehousing & Analytics

75

The DW Sandbox More empowerment for customers Motive

Lots of little MS Access databases (with valuable data) spread all over the place

Needed to be joined with DW data Costly to maintain PC hogs

Solution Provide customers with their own “sandbox” on the DW, with DBA-like

priv’s

Page 76: Healthcare Best Practices in Data Warehousing & Analytics

76

Features Web based tool for creating tables and

loading MS Access data to the DW Simple, easy to use interface

Privileges Users have full rights to the tables they create Can grant rights to others

Big, big victory for customer service and data “maturity” 10% of DW customers use the

Sandbox About 600 tables in use now About 2G of data

Page 77: Healthcare Best Practices in Data Warehousing & Analytics

77

Design-Build Best Practices Build vertically, design horizontally

Start by building data marts that address analytic needs in one area of the business with a fairly limited data set

But, design with the horizontal needs of the company in mind, so that you will eventually “tie” all of these vertical data marts together with a common semantic layer

Page 78: Healthcare Best Practices in Data Warehousing & Analytics

78

Creating Value In Both Axes

Build

Design

Page 79: Healthcare Best Practices in Data Warehousing & Analytics

79

For Example…Ca

ncer

Reg

istry

Mam

mog

raph

yRa

diol

ogy

Path

olog

y

Labo

rato

ry

Cont

inui

ng C

are

And

Follo

w-Up

Qual

ity o

f Life

Surv

ey

Radi

atio

nTh

erap

y

Heal

th P

lans

Clai

ms

Ambu

lato

ryCa

sem

ix

Acut

e Ca

reCa

sem

ix

An Integrated Reporting Model of Cancer Patient’s Data

Oncology Data Integration StrategyTop down reporting requirements and data model

Disparate Sources “connected” semantically to the data bus

Page 81: Healthcare Best Practices in Data Warehousing & Analytics

81

Evidence of Business Process Alignment

1. Map out your high level business process Don’t fall prey to analysis paralysis with endless business

process modeling diagrams!2. Identify and associate the transaction systems that support

those processes3. Identify the common, overlapping semantics/data attributes

and their utilization rates4. Build your data marts within an enterprise framework that is aligned with the processes you are trying to understand

Page 82: Healthcare Best Practices in Data Warehousing & Analytics

82

For example…

DiagnosisHealth Need PatientPerceptionProcedure Results &

Outcomes

Episode of Care

AP/AR Claims ProcessingHealthcare business process

HELP Lab HPIMC400

SurveyAS400IDX HDMCIS/CDRHNA

Supported by non-integrated data in Transaction Systems…

Rx

Integrated in the Data Warehouse

DataWarehouse

Page 83: Healthcare Best Practices in Data Warehousing & Analytics

83

Event Correlation A leading edge Best Practice

The third dimension to rows and columns Overlays the data that underlies a report or graph

“In 2004, we experienced a drop in revenue as a result of the earthquake that destroyed our plant in the Philippines.”

“In January of 2005, we saw a spike in the North America market for snow shovel sales that coincided with an increase in sales for pain relievers. This correlates to the record snowfall in that region and should not be considered a trend. Barring major product innovation, we consider the market for snow shovels in this area as saturated. Sales will be slow for the next several years.”

Page 84: Healthcare Best Practices in Data Warehousing & Analytics

84

Standardizing Semantics Sweet irony are the many synonyms for “standard

semantics” Data dictionary Vocabulary Dimensions Data elements Data attributes

The bottom line issue: Standardizing the terms you use to describe key facts about your business

Page 85: Healthcare Best Practices in Data Warehousing & Analytics

85

Standardizing “Names of Things” You better do it within the first two months of your

data warehouse project If you are beyond that point, you better stop and do it now,

lest you pay a bigger price later

Don’t… Push the standard on the source systems, unless it’s easy

to accomplish This was one of the common pitfalls of early data

warehousing project failures Try to standardize everything under the sun!

Focus on the high value facts

Page 86: Healthcare Best Practices in Data Warehousing & Analytics

86

Where Are The “High Value” Semantics?In the high-overlap, high-utilization areas…

SourceSystem X

SourceSystem Y

SourceSystem Z

Highest value area for

standardizing semantics

Page 87: Healthcare Best Practices in Data Warehousing & Analytics

87

Another Perspective

Semantic Utilization

Sem

antic

Ove

rlap

Page 88: Healthcare Best Practices in Data Warehousing & Analytics

88

The Standard Semantic “Layer”

DataWarehouse

Source Systems Extract, Transform, Load

Semantic Standards

Page 89: Healthcare Best Practices in Data Warehousing & Analytics

89

Data Modeling Star schemas are great and simple, but they aren’t

the end-all, be-all of analytic data modeling Best practices: Do what makes sense– don’t be a schema

bigot I’ve seen great analytic value from 3NF models

Maintain data familiarity for your customers When meeting vertical needs Don’t make massive changes to the way the model looks

and feels, nor the naming conventions– you will alienate existing users of the data

Use views to achieve “new” or standards-compliant perspectives on data When meeting horizontal needs

Page 90: Healthcare Best Practices in Data Warehousing & Analytics

90

For Example…

Source perspective

DW perspective

Similar names & organization

Vertical data customerHorizontal data customer

“Standardized” view

Page 91: Healthcare Best Practices in Data Warehousing & Analytics

91

The Case For Timely Updates%

Req

uest

s fo

r D

ata

utili

zati

on

Data Age

0

100

Today 1 year 2 years

Generally, to minimize Total Cost of Ownership (TCO), your update frequency should be no greater than the decision making cycle associated with the data. But… everyone wants more timely data.

Page 92: Healthcare Best Practices in Data Warehousing & Analytics

92

Best Practice: Measure Yourself

Employee satisfaction Customer satisfaction Average number of

queries/month Number of queries above a

threshold (30 minutes?) Average query response time Total number of records

Total number of query-able tables

Total number of query-able columns

Number of “users” Average rows delivered per

month Storage utilization CPU utilization Downtime per month by data

mart

The Data Warehouse Dashboard

Page 93: Healthcare Best Practices in Data Warehousing & Analytics

93

Other Best Practices

The Data Warehouse Information Systems Team reports to the CIO Most data analysts can and probably

should report to the business units

Change management/service level agreements with the source systems No changes in the sources systems

unless they are coordinated with the data warehouse team

Page 94: Healthcare Best Practices in Data Warehousing & Analytics

94

More Best Practices Skills of the Data Warehouse IS Team

Experienced chief architect/project manager Procedural/script programmers SQL/declarative programmers Data warehouse storage management architects Data warehouse hardware architects and system

administrators Data architects/modelers DBAs

Page 95: Healthcare Best Practices in Data Warehousing & Analytics

95

More Best Practices Evidence of project collaboration

A cross section of members and expertise from the data warehouse IS team

Statisticians and data analysts who understand the business domain

A customer that understands the process(es) being measured and can influence change

A data steward– usually someone from the front lines who knows how the data is collected

Project = complex reports or a data mart

Page 96: Healthcare Best Practices in Data Warehousing & Analytics

96

More Best Practices When at all possible, always extract as close

to the source as possible

PrimarySource

Copy A

Copy B

DataWarehouse

Best Practice Path

Page 97: Healthcare Best Practices in Data Warehousing & Analytics

97

The Most Popular Authors I appreciate…

The interest they stir The vocabulary– semantics– of this new specialty that they helped

create

The downside… The buzzwords that are more buzz than substance

“Corporate Information Factories” Endless, meaningless debate

“That’s not an Operational Data Store!” “Do you follow Kimball or Inmon?”

Follow your own common sense Most of these authors have not had to build a data warehouse from

scratch and live with their decisions through a complete lifecycle

Page 98: Healthcare Best Practices in Data Warehousing & Analytics

98

ETL Operations Besides the cultural risks and challenges, the riskiest part of a

data warehouse… Good book

Westerman, WalMart Data Warehousing

The Extract, Transform, and Load processes

Worthy of it’s own “Best Practices” discussion Suffice to say, mitigate risks in this area carefully and deliberately The major design errors don’t show up until late in the lifecycle,

when the cost of repair is great

Page 99: Healthcare Best Practices in Data Warehousing & Analytics

99

Two Essential ETL Functions Initial loads

How far back do we go in history? Maintenance loads

Differential loads or total refresh? How often?

You will run and tune these processes several times before you go into production How many records are we dealing with? How long will this take to run? What’s the impact on the source system performance?

Page 100: Healthcare Best Practices in Data Warehousing & Analytics

100

Maintenance Loads Total refresh vs. Incremental loads

Total refresh: Truncate and reload everything from the source system

Incremental: Load only the new and updated records For small data sets, a total refresh strategy is the

easiest to implement How do you define “small”? You will know it when don’t

see it. Sometimes the fastest strategy when you are trying to

show quick results Grab and go…

Page 101: Healthcare Best Practices in Data Warehousing & Analytics

101

Incremental Loads How do we get a snapshot of the data that

has changed since the last load? Many source systems will have an existing log file

of some kind Take advantage of these when you can,

otherwise incremental loads can be complicated

Page 102: Healthcare Best Practices in Data Warehousing & Analytics

102

File Transfer FormatsDesign your extract so that it uses… Fixed, predetermined length for all records and fields

Avoid variable length if at all possible A unique character that separates each field in a record, such as ~ A standard format for header records across all source systems

Such as the first three records in each file Include name of source system, file, and record count and number of fields in the record This will be handy for monitoring jobs and collecting load metadata

Page 103: Healthcare Best Practices in Data Warehousing & Analytics

103

Benefits of Standard File Transfer Format

Compatible with standard database and operating system utilities Dynamically create initial and maintenance load scripts

Read the table definitions (DDL) then merge that with the standard transfer file format

Dynamically generate load monitoring data Read the header row, insert that into a

“Load Status” table with status “Running”, # of records, start time

At EOF, change status to “Complete” and capture end of load time

I wish I would have thought about this topic more, and earlier in my career

Page 104: Healthcare Best Practices in Data Warehousing & Analytics

104

Westerman Makes A Good Point

My experience: ETL is the least tasteful and productive use of a veteran EDW Team member, so I like Westerman’s insight on this topic

If you design for instantaneous updates from the beginning, it translates to less ETL maintenance and labor time for the EDW staff, later

Page 105: Healthcare Best Practices in Data Warehousing & Analytics

105

Messaging Applied to ETL

Basic concepts Use a load message queue for records that need to be updated, coming

from the source systems When the EDW analytical processing workload is low (off-peak), pick the

next message off the load queue and load the data Run this in parallel so that you can process several load messages at

the same time while you have a window of opportunity Sometimes called “throttling”

Speed up and slow down based upon traffic conditions

Motive behind the concept Continuous updates in a mixed

workload environment Mixed: Analytical processing at the

same time as transaction oriented, constant updates, deletes, inserts

Page 106: Healthcare Best Practices in Data Warehousing & Analytics

106

ETL Message Queue Process

Source Systems•Updates•Inserts•Deletes

ETL Message Queue

ETLManager

DatabaseWorkload andPerformance

MetricsEDW Production Tables

Page 107: Healthcare Best Practices in Data Warehousing & Analytics

107

Four Data Maintenance Processes

Initial load Loading into an empty table

Append load Update process Delete process As much as practical, use your database utilities for these

processes Study and know your database utilities for data warehousing;

they are getting better all the time I see some bad strategies in this area-- companies spending

time building their own utilities…aye cucumber!

Page 108: Healthcare Best Practices in Data Warehousing & Analytics

108

A Few Planning Thoughts Understand the percentage of records

that will be updated, deleted, or inserted You’ll probably develop a different

process for 90% inserts vs. 90% updates

Logging In general, turn logging off during the processes, if logging was

on at all Field vs. Record level updates

Some folks, in the interest of purity, will build complex update processes for passing only field (attribute) level changes

No brainer: Pass the whole record

Page 109: Healthcare Best Practices in Data Warehousing & Analytics

109

Initial Load

Every table will, at some time, require an initial load For some tables, it will be the best choice for data

maintenance Total data refresh Best for “small” tables

Simple process to implement Simply delete (or truncate) and reload with fresh

data

Page 110: Healthcare Best Practices in Data Warehousing & Analytics

110

A Better Initial Load Process Background load

Safer– protects against corrupt files Higher availability to customers

Three or four steps… maybe 6?1. Create a temporary table2. Load the temporary table3. Run quality checks4. Rename the temporary table to the production table name5. Delete the old table6. Regrant rights, if necessary

Westerman: “You want to use as many initial load processes as possible.”

I agree!

Page 111: Healthcare Best Practices in Data Warehousing & Analytics

111

Append Load For larger tables that accumulate historical

data There are no updates, just appends

A hard fact that will not change Example

Sales that are closed Lab results

Page 112: Healthcare Best Practices in Data Warehousing & Analytics

112

Append Load Options Load a single part of a table Load a partition and ‘attach’ it to the table

Create a new, empty partition Load the new records Attach the partition to the table

Look to use the “LOAD APPEND” command in your database

Page 113: Healthcare Best Practices in Data Warehousing & Analytics

113

Another Append Option1. Create a temp table identical to the one you

are loading2. Load the new records into the empty temp

table3. Issue INSERT/SELECT

INSERT INTO Big_Table (SELECT * FROM Temp_Big_Table)

4. Delete the temp table

IF # RECORDS IN TEMP IS MUCH < # OF RECORDS IN BIGTHEN GOOD TECHNIQUEELSE NOT GOOD

Page 114: Healthcare Best Practices in Data Warehousing & Analytics

114

Update Process The most difficult and risky to build Use this process only if the tables are too large for a

complete refresh, “Initial Load” process

Updates affect data that changes over time Like Purchase Orders, hospital transactions, etc. Medical records, if you treat the data maintenance

at the macroscopic level

Page 115: Healthcare Best Practices in Data Warehousing & Analytics

115

Update Process Options Simple process

1. Separate the affected records into an update file, insert file, or delete file

Do this on the source system, if possible2. Transfer the files to the data warehouse staging area3. Create and run two processes

– A delete process for deleting the records in the production table that need updated or deleted

– An insert process for inserting the entirely new “updated” record into the production table, as well as the true inserts

Simple, but typically not very fast

Page 116: Healthcare Best Practices in Data Warehousing & Analytics

116

Simple Process

Updated records

Deleted records

New records

Source System

Updates

Deletes

Inserts

EDW Staging AreaDelete Process

Insert Process

EDWProduction Table1

24

56

1. Delete Process identifies records for deletion from the Production Table based upon contents of the Updates file.

2. Delete Process identifies records for deletion from the Production Table based upon contents of the Deletes file.

3. Delete process deletes records from Production Table.

4. Insert Process identifies records for insert to the Production Table based upon contents of the Updates file.

5. Insert Process identifies records for insert to the Production Table based upon contents of the Inserts file.

6. Insert Process inserts records into the Production Table.

3

Page 117: Healthcare Best Practices in Data Warehousing & Analytics

117

When You Are Unsure Sometimes, source system log and audit files make

it difficult to know if a record was updated or inserted (i.e. created)

Try this…

1. Load the records into a temp table that is identical to the production table to be updated

2. Delete corresponding records from the production table

DELETE FROM prod_table WHERE key_field

IN (SELECT temp_key_field FROM temp_table)

3. Insert all the records from the temp table into the production table

Most databases now support this with an UPSERT

Page 118: Healthcare Best Practices in Data Warehousing & Analytics

118

Massive Deletes Just as with Updates and Inserts, the number of Deletes you

have to manage is inversely proportional to the frequency of your ETL processes Infrequent ETL Massive data operations

Partitions work well for this, again E.g., keeping a 5 year window of data

Insert most recent year with a partition Delete the last year’s partition

Blazing fast!

1 2 3 4 5

Delete partition

Insert partition

Page 119: Healthcare Best Practices in Data Warehousing & Analytics

119

“Raw” Data Standards for ETL Makes the process of communicating with your source system

partners much easier Data type (e.g., format for date time stamps) File formats (ASCII vs. EBCDIC) Header records Control characters

Rule of thumb Never transfer data at the binary level unless you are transferring

between binary compatible computer systems Use only text-displayable characters Less rework time vs. Less storage space and faster transfer speed

Storage and CPU time are cheap compared to labor

Page 120: Healthcare Best Practices in Data Warehousing & Analytics

120

Last Thought…Indexing Strategies

Define these early, practice them religiously, use them extensively

This is “Database Design 101” Don’t fall prey to this most common performance problem!

Page 121: Healthcare Best Practices in Data Warehousing & Analytics

121

My Thanks For being invited… For your time and attention For the many folks who have worked for and with

me over the years that made me look better as a result

Please contact me if you have any questions [email protected] PH: 312-695-8618