moving towards a data repository that facilitates data analysis chop november 18, 2009 1
TRANSCRIPT
Normalization
• Normalization - process of efficiently organizing data in a database to reduce redundancies of data
• Goal - consistency of data – Store data once and one time only!– security – disk space– speed of queries– efficiency of database updates– data integrity
In normalized database no aggregation and no calculated fields
3
Unnormalized data set
Patient ID
Name Address DOB Doc ApptDate
Location DX
111111 Cindy Marselis
2320 Edge Hill Road
1/11/64 Armstrong 9/1/09 11:00 AM
Alter 2011 Herniated DiscFlu
111111 Cindy Marselis
9331 Rising Sun Avenue
1/11/64 Morningstar 9/1/09 11:00 AM
Alter 2011 Herniated Disc
111111 Cindy Marselis
2320 Edge Hill Road
1/11/64 Allen 11/1/09 10:00 AM
Alter 2012 Psoriasis
222222 Kathryn Marselis
2320 Edge Hill Road
11/3/04 Dershaw 8/1/09 11:00 AM
Speakman 105
Well baby check
111111 Cindy Schwartz
9331 Rising Sun Avenue
1/11/64 Armstrong 8/11/09 3:00 PM
Alter 105 PsoriasisHerniated Disc
5
Free text
• Issues with string searches– Must match exactly in
case, punctuation, spelling, etc.
• Use of lookup tables where possible
10
Unstructured Text
• Gartner: white-collar workers spend from 30 to 40% of time managing documents
• Merrill Lynch: > 85 % of business information exists as unstructured data
– e-mails, memos, notes from call centers and support operations, news, user groups, chats, reports, letters, surveys, white papers, marketing material, research, presentations and Web pages.
• In relational db, data that can't be stored in rows and columns.
– stored in a BLOB (binary large object)
– e-mail files, word-processing text documents, PowerPoint presentations, JPEG and GIF image files, and MPEG video file
• Metadata (data about data can be stored)http://www.information-management.com/issues/20030201/6287-1.html 11
Approaches to structured and unstructured data
1. Unique database: consolidates all structured and unstructured data together
– expensive to buy and maintain
– large volume of data can clog the database making it slow and inefficient
2. Use two databases: one structured data, and one for unstructured data.
– Avoids performance issues with structured data
– significant performance limitations for unstructured data
12
Approaches to structured and unstructured data
3. Unstructured data left on file servers with database to record and links to unstructured data files.
– Avoids issue with volumes of data
– Fragile as links are broken when files and folders moved around.
– Must create links every time new document created
4. Complex and expensive connectors used to tap in all databases and file servers providing unified view of data.
– Expensive and complex requiring purchase and maintenance of multiple databases and file servers with the added cost of all required connectors.
5. Patents currently under development.
13
Certification Commission for Health Information Technology (CCHIT)
EHR Construct
EMAR: Electronic Medication Administration RecordCPOE: Computerized Physician Order Entry PFS: Physician Fee ScheduleOC/RR: Physician Order Communication/Results RetrievalCPOE: Computerized Physician Order Entry PFS: Physician Fee ScheduleR-ADT: Registration Admission Discharge Transfer
14
16
• External and internal forces require tactical and strategic decisions
• Search for competitive advantage
• Business environments are dynamic
• Decision-making cycle time is reduced
Pressures Driving Need for Business Intelligence and Data Warehousing
17
• Operational data – Relational, normalized
database – Optimized to support
transactions – Real time updates
Operational vs. Decision Support Data
• DSS – Snapshot of operational
data– Summarized – Large amounts of data
• Data analyst viewpoint– Timespan– Granularity– Dimensionality
19
• Separated from operational environment• Integrated Data• Historical data over long time horizon• Snapshot data captured at given time• Subject-oriented data• Mainly read-only data with periodic batch updates from
operational source, no online updates
Codd’s Key Data Warehouse Rules
20
• Contains different levels of data detail – Current and old detail– Lightly and highly summarized
• Metadata (data about the data) critical components – Identify and define data elements– Provide the source, transformation,
integration, storage, usage, relationships, and history of data elements
Codd’s Key Data Warehouse Rules contd.
Decomposition of DSS –Operational Data
o Tumor registryo A/D/To Radiology narrativeo Pathology narrativeo Lab resultso Patient Accountingo Charge Master
23
Decomposition of DSS –External Data
o Research spidero Treatment guidelineso Reimbursement scheduleso NCI/NIH protocols
24
Decomposition of DSS –ETL
• Rationalize normal lab values• Transform gender codes and free
text• Narrative dumps• Doctor cleansing
o Similar nameso Which practice gets credit?
25
26
ETL – Extraction, Transformation, Load
Transform: cleanse data for consistency and output exceptionso Apply business ruleso Selecting certain columns to load
(not null records)o Translating coded values (1, M, male
= 0 )o Derive new calculated value
(sale_amount = qty * unit_price)o Join data from multiple sources
(lookup, merge)o Aggregate (rollup/summarize data –
average LOS for each doctor by DRG)
o Transpose/pivot (turning columns into rows)
o Data validation.
Extractdata from
source systems
Load:data into
repository
ETL Best Practice
27
• ETL: 60-80% of development effort• Create multi-departmental team charged with
consensus on Transformation!• Review exceptions carefully
o Indicator of issues with operational db designo Indicator of changes needed in transformation
Decomposition of DSS –Business Data
• Business data – central repository
• Includes metadata: source, format, timing of feeds
Characteristic Factors
Integrated • Centralized• Holds data retrieved from
entire organization
Subject-Oriented
• Optimized to give answers to diverse questions
• Used by all functional areas
Time Variant • Flow of data through time• Projected data
Non-Volatile • Data never removed• Always growing
28
Decomposition of DSS –Business Model Data
• Comprehensive Cancer Center definition of a patient
o Must have seen a physician for suspected or confirmed benign or malignant condition
o What about patients seen for screening mammography?
29
Decomposition of DSS –End user query tool
• Web-based or client-server?• OLAP – Online Analytic Processing
o Microsofto Business Objects (bought by
SAP)o MicroStrategyo Cognos (bought by IBM)o Oracle (includes Hyperion)
30
Decomposition of DSS –End –user tool
o Drill down functionalityo Roll upo Charts – not data levelo Export features
http://demos.telerik.com/aspnet-ajax/chart/examples/functionality/drilldown/defaultcs.aspx
31
Star Schema
• Center fact tableo usually contains numeric information for summary reports.
• Dimension table radiate from fact table• Dimension table is hierarchial
• ‘rollup’ allows to compare types of hospitals, disease categories, or even patient age bands.
• Creates logical data cube dimensions identifying a set of numeric measurements within the cube.
33
34
• Data-modeling technique • Maps multidimensional decision support into relational
database• Yield model for multidimensional data analysis while
preserving relational structure of operational DB• Four Components:
– Facts– Dimensions– Attributes– Attribute hierarchies
Star Schema
39
• Advanced data analysis environment• Supports decision making, business modeling, and
operations research activities • Characteristics of OLAP
– Use multidimensional data analysis techniques– Provide advanced database support– Provide easy-to-use end-user interfaces– Support client/server architecture
Online Analytical Processing (OLAP)