relational databases and statistical processing andrew westlake survey & statistical computing...

Relational Databases and Statistical

Processing

Andrew WestlakeSurvey & Statistical Computing

+44 20 8374 4723 [email protected]

15-Mar-03 Statistical Databases and RDBMS © S&SC 2

Introduction

• Good Statistical Analysis needs Good Data– Many things contribute to data quality

• Quality of structures and processes for data collection related to quality of data

• Good Ideas from Computer Science– Process design methodologies

– Database methodologies

– Data and process modelling systems

• Statisticians can benefit from CS concepts– Particularly for design of complex structures and

continuous processes


Data Structure

• Statistics generally based on a single rectangular dataset– Much statistical data is conceived and stored in this form

• Database professionals often use Relational Database systems– Formal model, based on a set of related rectangular datasets– Allows the representation of more complex structures, and

easier maintenance and manipulation of the data.

• RDB approach valuable for some statistical situations– Review RDB concepts and functionality, plus SQL– Statistical Examples

• Pakistan Fertility Survey• Processing and analysis system for HIV and AIDS reporting for

England and Wales


The Relational Model

• A logical specification of the content and behaviour of a database management system, including

• The types of structure that can be present in a database

• The properties of elements that can be stored in these structures

• The operations that can be performed on these structures and their behaviour

• Facilities that must be present in the database management system

• The general nature of the interactions between the database and its users and administrators

– Codd’s Rules specify properties that a Relational DataBase Management System (RDBMS) must possess

• SQL– A standard language with which to interact with a RDBMS.


Relational Database System Structure

User Processes

Data Modelling

SQL Relational Model

Storage & Access Methods

Applications & Tools, include functionality, semantics appropriate to application

Expressing the real task in terms of the conceptual model

Standard interface to an RDBMS, syntax and embedding

Specification of functionality, behaviour and scope

Implementation issue, affects performance

D

D Data

P

D Rules

Database

P

P

P

P

(SQL)

User Processes

SchemaDesign

DataEntry

TransactionProcessing

MISReports

DBMS

InsertionDeletionUpdates

Consistency

SortingGrouping

Aggregation

RetrievalSelection

Transformation

DataDictionary

DataBaseManagement

System


RDBMS Strengths

• Relational Model– Precise, formal mathematical specification of structure and

behaviour– Appropriate for information about Entities, and

Relationships between them

• Data Modelling– Useful tools for understanding data structures and flows

• SQL– International Standard (SQL2), widely implemented

• Current Implementations– Widely available, well supported, good implementations,

integration with other products, add-on market for tools


Components of the Relational Model

• Domain• a set of possible (atomic) values

• Attribute• defined over a domain, has a name, cf. variable

• Tuple• a set of values, one associated with each attribute, cf. cases

or records. NULL values supported

• Relation• defined over a set of attributes, has a name, consists of a set

of tuples

• Relational DataBase• a set of relations


Relations

• Rectangular Data tables, in which columns are variables, rows are cases.

• Tables can be linked by the values of common variables.

Relation Cancer_ Registration

Domain Registration

ID Gender Age in years Weight in

Kg, 1dp SEG Age in

years ICD

Attribute ID Sex Age_at_ Registration

Weight SEG Age_at_ Diagnosis

Diagnosis

4951 2 58 54.5 3 53 194 4952 1 68 75.8 2 60 285 4953 2 73 62.3 5 70 162 4954 2 52 48.7 2 45 501 4955 1 49 95.2 4 47 162 4956 1 68 112.7 3 61 196

Tuples 4957 1 87 84.2 4 85 203 4958 2 74 69.4 4 70 162 4959 1 69 53.0 1 69 186 4960 2 92 83.0 5 92 162 4961 1 45 67.4 5 41 503

Relation Lung_ Diseases

Domain ICD String Attribute Diagnosis Disease_Name

162 Cancer of the Lung

Tuples 501 Asbestosis 503 Mesothelioma


SQL

– International Standard, actively revised

– Widely available in good RDBMS software

– Text language, used by programs and people• Stored or constructed at run-time

– Designed to support tools which are independent of the DBMS• cf. Client-Server architecture

– User and Programmer skills portable across products and sites

– Has sections for defining database structure (DDL), Manipulating database content (DML)Ensuring database integrity, andManaging database security


SQL Data Manipulation

SELECT {DISTINCT}<expression list>FROM <table list>WHERE <condition>GROUP BY <column list>HAVING <group condition>ORDER BY <column list>

• All manipulation (retrieval) of data uses the single Select statement, which has various components corresponding to different relational operations.

• The result of a SELECT statement is a relational table, which is displayed (by default) or can be stored or processed in another statement.

• Retrieval is usually done through a Query interface, which generates the SQL.


Data Manipulation - Join

• The Join operation selects the combination of rows from two or more tables that meet a matching conditionSELECT R1.*, R2.* FROM R1, R2 WHERE A = X

• All matching combinations are retained– Eg can relate household information to all members via

household number

Join A=X

A B C R1.D X R2.D E

1 b1 c1 d1 1 d3 e1

2 b2 c2 d2 2 d5 e2

3 b3 c3 d3 3 d3 e3


Data Manipulation - Aggregation

• Aggregation functions can be used in the variable listSELECT COUNT(DIAGNOSIS), COUNT(DISTINCT DIAGNOSIS),

AVE(WEIGHT) FROM CANCER_REGISTRATION

– This produces one record for the whole table containing three values, the number of records with non-null values for Diagnosis, the number of distinct Diagnosis values, and the average Weight.

– The available functions are COUNT, SUM, AVE, MIN, MAX.

– Aggregation functions can be applied to expressions.


Data Manipulation - Grouping• The GROUP BY clause is used to perform aggregation

within subgroups, producing a record for each group. SELECT DIAGNOSIS, COUNT(DIAGNOSIS), AVE(WEIGHT)

FROM CANCER_REGISTRATION GROUP BY DIAGNOSIS ORDER BY DIAGNOSIS

– This produces a record for each diagnosis containing the number of registrations and their average weight as well as the diagnosis code.

– It is sorted into diagnosis code order, using the ORDER BY clause.

– The Group By clause can contain multiple fields (for cross-tabulation).

– The expression list can contain variables used in the Group By clause, and aggregate functions for any variable.

– A WHERE clause selects the records to be aggregated.– A HAVING clause selects the groups to be returned.


Views

• Stored query definition– Can be evaluated on demand

– Used to present data in the form needed by the user (a person or a process)

• Dynamic evaluation– Ensures that the viewed information is up to date

– May be inefficient if the information does not change


Database Design

• Choose physical tables and fields– Capture essential characteristics of real-world system

– Support intended uses of the information (directly or through views)

– Operate efficiently

• Identify important links between tables– Provide appropriate Keys

– Choose Indexes and other implementation details

• Retain flexibility for future needs• Various methodologies for design process


RDB Summary

• Relational databases are ubiquitous, and are useful for large-scale data collections

• Some manipulation and aggregation operations can be done more easily than in statistical packages

• Relational model is a useful way of thinking about data structures– May suggest different data structure

• No replacement for statistical packages for statistical analysis• Flexible manipulation of more complex data structures• Useful development tools for processes• Interface to Views for Statistical Analysis

– ODBC or similar– Extract micro- or macro-data for statistical analysis


Databases for Statistics• Simple situations

– Use statistical or survey packages

• Studies with Complex Structure– Use RDB for flexible manipulation

– May store static result tables

• Ongoing Data Capture Processes– Use RDB for dynamic data entry, updating, matching and

checking

– Choose data structure to optimise data processes and quality, not ease of statistical processing

– May take static snapshots for regular statistical analysis

• Statistical Dissemination – another presentation

• Statistical Metadata – yet another presentation


PFFPS – Main

structure

PFFPSHID

PK,FK1,I1 CLUSTERPK H_NO

RECORD_TYPPROVINCEAREADISTRICTVISITF_DAYF_MONF_YEAINTERVIEWSUPERVISORDEORESULTQ16TQ16MQ16FQ17Q18Q19Q20Q21Q22Q23_1Q23_2Q23_3Q23_4Q23_5Q23_6Q23_7Q23_8Q24Q25_1Q25_2Q25_3Q25_4Q26Q27

PFFPSHHM

PK,FK1,I1 CLUST_HHMPK,FK1,I1 H_NO_HHMPK Q01

RECORD_TYPPROV_HHMAREA_HHMQ03Q04Q05Q06Q07Q08Q09Q10Q11Q12Q13Q14Q15

PFFPSWID

PK,FK1 CLUST_HHMPK,FK1 H_NO_HHMPK,FK1 Q01

RECORD_TYPI10 PROV_WIDI1 AREA_WIDI9 LINE_WIDI3 DIST_WIDI4 F_DAY_WIDI5 F_MON_WIDI6 F_YEA_WIDI11 VISIT_WID

WOM_RESQ101HQ101MQ102Q103Q104Q105MQ105YQ106Q107Q108LQ108CQ109Q110Q111Q112Q113Q114Q115Q201Q202Q203SQ203DQ204Q205SQ205DQ206Q207SQ207DQ208Q210Q218Q219Q220Q221Q222Q223Q224Q225_1Q225_2Q225_3Q225_4Q225_5Q225_6Q225_7Q226Q227Interview CMCBirth CMC

Weights

PK Cluster

Weight


Data Tables for PFFPS

Weights

PK Cluster

PFFPSHID

PK,FK1,I1 CLUSTERPK H_NO

PFFPSHHM

PK,FK1,I1 CLUST_HHMPK,FK1,I1 H_NO_HHMPK Q01

PFFPSWID

PK,FK1 CLUST_WIDPK,FK1 H_NO_WIDPK,FK1 L_NO_WID

PFFPSWBH

PK,FK1 CLUST_WBHPK,FK1 H_NO_WBHPK,FK1 L_NO_WBHPK LINE_WBH

PFFPSWPB

PK,FK1 CLUST_WPBPK,FK1 H_NO_WPBPK,FK1 L_NO_WPBPK,FK1 LINE_WPB

PFFPSWC1

PK,FK1 CLUST_WC1PK,FK1 H_NO_WC1PK,FK1 L_NO_WC1PK Q301

PFFPSWC2

PK,FK1 CLUST_WC2PK,FK1 H_NO_WC2PK,FK1 L_NO_WC2

PFFPSWMF

PK,FK1 CLUST_WMFPK,FK1 H_NO_WMFPK,FK1 L_NO_WMF

PFFPSWSE

PK,FK1 CLUST_WSEPK,FK1 H_NO_WSEPK,FK1 L_NO_WSE


PFFPS Meta Data Structure

Record

PK Record

Dictionary

PK Field

FK1,I1 Record

Category

PK,FK1 FieldPK,I1 Code

Range

PK,FK1 FieldPK Lower

Skip

PK,FK1 FieldPK,I1 Code


HIV and AIDS reporting section, PHLS

• Redesign of processing and reporting system• Old system (1994) based on files processed with

Epi-Info and Aspect database• Requirements

– Simplify management of data

– Improve consistency

– Simplify regular reporting

– Enable ad hoc reporting


Old System

• Input Notifications– HIV Tests (standard form)

– AIDS Diagnosis (standard)

– ONS Death Reports

• Some notifications on disc, most on paper

• Data Storage– Separate HIV and AIDS files

– Deaths Added

• Reporting– Quarterly standard reports

– Ad Hoc inquiries (e.g. PQs)

– Occasional in-depth analysis

• Processes– Continuous Data Entry from

paper

– Regular Linking for reporting extractions

– Occasional Matching within and between files

• Problems– Duplicates and omissions

– Name matching

– Repeated linking

– Inflexible for reporting (eg progression from HIV to AIDS)


Old Structure

HIVNotification

File

AIDSNotification

File

HIV Forms

AIDS Forms

Death Reportson disc

QuarterlyExtracts


Initial Thoughts

• Notification forms appropriate for reporting, but not for primary information store

• Information is about Patients– Notifications are sources of information– Record information about events and states

• Do matching once, at initial entry• Unified database structure• Additional components for metadata, auditing, contact

information• Extracts to separate statistical reporting system


Full system

Patient Data

Notification Processing

External Organisation

Metadata System

Auditing SystemReporting Facilities

Overall Structure


Main Patient Tables

D Patient

PK Patient ID

I11 SoundexI5 InitI2 DOBI10 SexI7 OccI9 PC OutI8 PC InI1 DHAI6 LAI3 Ethnic GpI4 ExpFK1,U2,U1 Audit ID

D HIV Infection

PK,FK2 Patient ID

U3 HivNoFK3 Hospital IDFK1,U2,U1 Audit ID

D AIDS Diagnosis

PK,FK2 Patient ID

U1 AidsNoFK3 Hospital IDFK1,U3,U2 Audit ID D Death Record

PK,FK2 Patient ID

FK3,U1 AidsNoFK4,U5 HivNoU6 ONS IDU4 Death IDI1 Date DeathFK1,U3,U2 Audit ID

D AIDS Indicator

PK,FK1,I1 Patient IDPK Indicator Type CodePK Indicator Code

FK2,U2,U1 Audit ID


Patients and Notifications

SourceNotifications

OptionalComponents

RepeatingComponents

Patient

HIV Infection

AIDSDiagnosis

DeathRecord

HIVNotification(green form)

AlternativeIDs

(Soundex, DoB)

PreviousStatus

(Date)

AIDSNotification(blue form) Blood

Donation

BloodRecipient

Follow UpStudy(Vicky)

Action(Type, Date,Outcome)

External ID(Context,Institution)

ONS DeathNotification

Residence(PostCode, Date)

AIDSIndicator

ExternalFiles


Full HAP Data

Structure

D Death Record

PK,FK2 Patient ID

FK3,U1 AidsNoFK4,U5 HivNoU6 ONS IDU4 Death IDFK1,U3,U2 Audit ID

D Patient

PK Patient ID

FK1,U2,U1 Audit ID

D AIDS Diagnosis

PK,FK2 Patient ID

U1 AidsNoFK3 Hospital IDFK1,U3,U2 Audit ID

D HIV Infection

PK,FK2 Patient ID

U3 HivNoFK3 Hospital IDFK1,U2,U1 Audit ID

D External Patient ID

PK,FK3,I3 Patient IDPK,FK2,I1 Organisation IDPK,I2 ID in Institution

FK1,U1 Audit ID

D Alternative ID

PK,FK2,I1 Patient IDPK,I3 Identifier Type CodePK,I2 Identifier Value

FK1,U2,U1 Audit ID

D Action

PK Action ID

FK2,I4,I2 Patient IDFK1,U1,U2 Audit ID

D Contact

PK,FK3,I2 Organisation IDPK,FK2,I1 Person ID

FK1,U2,U1 Audit ID

D Person

PK Person ID

FK1,U2,U1 Audit ID

/ Person

D Organisation

PK Organisation ID

FK2,I2 DirectorFK3,I3 Primary ContactFK1,U2,U1 Audit ID

/ works at

D AIDS Indicator

PK,FK1,I1 Patient IDPK Indicator Type CodePK Indicator Code

FK2,U2,U1 Audit ID

D Notification

PK Notification ID

FK3 Patient IDFK2,I2,I1 Batch IDU3 Notification NoFK4,I6 Organisation IDFK1,U1,U2 Audit ID

D Audit Block

PK Audit ID

FK1,I1 Session IDFK2,I2 Creation Session ID

D Session

PK Session ID

FK1,I1 User Code

D User

PK User Code

FK1,I1 Session ID

D Note

PK Note ID

FK2,I2,I1 Patient IDFK1,U2,U1 Audit ID


Data EntryNotification

New NotificationEnter common Patient identification

from a New Notification

View Matches with existing Patients

Initiate entry of a new Patient orupdating an existing one

External Data Sources

Create anew

Patientrecord

New Patient

Transferinformationfrom New

Notification

Transfer

HIV Notification

Information about the Patient andHIV Infection taken from a green

HIV form

AIDS Notification

Information about the Patient andAIDS Diagnosis taken from a

blue AIDS form

Death Record

Add or Update DeathInformation for a Patient

Patient

HIV Infection AIDS DiagnosisDeath Record

• Get Patient details from source

• Look for matching existing patients– May need expert to

decide– Create new Patient if

needed• Open appropriate form

for source to complete transfer of Patient and details

• Forms know how to handle duplicates (generally take earliest)


Extraction

• Separate Extraction Database template– Linked to main database

– Set of associated Excel spreadsheet templates

• New copy each quarter– Populate by Extraction of information through Views

– Gives a Snapshot of Patient status

• Set of aggregation queries– Import results into spreadsheets as source

• Reorganise with Pivot tables– Create publication versions


Conclusions

• Good analysis needs good data– Need right structure for efficient processing

• RDB provides flexible structure and manipulation– Useful for complex situations– Can produce the views needed for statistical analysis– Quick to implement at a basic structural level

• Good framework for implementing production processes– Important to get the process right for good quality data– Not cheap to implement production processes

• Can easily include associated information– Metadata, other supporting information


The End

• Extended versions of presentations available from– www.sasc.co.uk– Follow links to MSc in Statistical Computing

relational databases and statistical processing andrew westlake survey & statistical computing...

Documents

relational databases

wales statistical databases

rdbms ssccomponents

rdbms sscsq

relationsstatistical

toolsstatistical databases

data collection

set of attributes