p20 seminar november 12, 20091 statistical collaboration part 1: working with statisticians from...
TRANSCRIPT
P20 Seminar November 12, 2009 1
Statistical Collaboration
Part 1: Working with Statisticians from Start to Finish
Part 2: Essentials of Data Management
P20 Seminar November 12, 2009 2
Objectives
Participants will learn about: process of consulting and collaborating
with statistician general principles of database setup,
data entry, verification, cleaning and storage
P20 Seminar November 12, 2009 3
Part 1: Working with Statistician from Start to Finish
Kay Savik, MS
P20 Seminar November 12, 2009 4
Collaboration
“Collaboration implies that statistician
and researcher want to learn and
exchange information. This exchange
should be mutually beneficial.”
Gerald van Belle
P20 Seminar November 12, 2009 5
Types of Consulting
Cross sectional - statistical advice for data already collected or analyzed
Longitudinal – a long term relationship between statistician and researcher
P20 Seminar November 12, 2009 6
First Meeting
Intent of study Source of data Sampling unit Randomization Model of effects Type of study Type of data
P20 Seminar November 12, 2009 7
First Meeting What is the research question?
What level of statistical knowledge does researcher have?
What are the data and what form are they in?
What are the conventions in this specific area of study?
P20 Seminar November 12, 2009 8
The Conversation
To prevent type III error – the right answer to the wrong question!
Clarify research aims Appropriate design Measurement Data management Analysis
P20 Seminar November 12, 2009 9
Analysis Choice
Sir David Cox –
“Begin with very simple methods and, if possible, end with simple methods”
Rinndskopf’s Rules of Statistical Consulting –
“Sometimes the “best” or “right” statistical procedure is not the best for a particular situation.”
P20 Seminar November 12, 2009 10
Which Statistical Package?
There is not one “perfect” software for any procedure
All standard packages have been tested and are reliable
“Specialized” procedures are found in several packages
P20 Seminar November 12, 2009 11
Collaborate Rather than Consult
Collaboration is a communal activity Decide who is responsible for what at
first meeting Politely and quickly leave a
collaboration where any party seems misguided or unethical
Decide on questions of authorship at first meeting
P20 Seminar November 12, 2009 12
Part 2: Essentials of Data Management (DM)
Olga Gurvich, MA
P20 Seminar November 12, 2009 13
Data Management
Essential part of any research Interactive and collaborative venture of
both investigator and statistician Requires a well-defined in advance
system and consistency in its implementation
P20 Seminar November 12, 2009 14
Data Management Stages
Database setup Raw data collection [who, what, when, how] Raw data entry, verification and cleaning Data storage [Data re-structuring for statistical analyses] [Data analysis] Data archiving
P20 Seminar November 12, 2009 15
Database Setup - Software
Choice mainly depends on
Amount of data to be collected Complexity of data structure Type of data Export/import capabilities to/from Planned statistical analyses and software
Software: try avoiding Excel SPSS, ACCESS, EpiInfo, output of survey
software, plain text (ASCII)
P20 Seminar November 12, 2009 16
Database Setup – Structure
Participants => rows; variables => columns
Logical Record: one row contains all data for a single study participant
Multiple Record: multiple rows per single participant
Relational: multiple data files that can be merged
P20 Seminar November 12, 2009 17
Database Setup - General
Give short, meaningful and “dated” name DB given to a statistician for cleaning and
analyses should include
- ONLY collected raw data;
- NO graphs, comments, titles, summaries,
hidden rows, split-spreadsheets, multiple
spreadsheets, imposed “special” formats
or highlighting
P20 Seminar November 12, 2009 18
Database Setup - Variables
Set unique numeric ID(-s) in 1st column (-s) Identify types of variables, measurement
units and type of recording [auto/manual] Carefully choose variables’ format and length Dates format MM/DD/YYYY; if parts are
missing, create three separate variables Time format dd hh:mm:ss or similar
P20 Seminar November 12, 2009 19
Database Setup - Variables
Create separate variable for every separate piece of information
Give unique, short [6-8 char], meaningful names
No special characters [!, %, $,spaces] Do not start with a number Consider other restrictions of specific
software [e.g., lower/upper case letters]
P20 Seminar November 12, 2009 20
Database Setup - Coding
Assign short and meaningful codes; consistent for same-response variables
Use numeric (if possible) coding;
do not combine num and char codes within a numeric variable
Address missing values Avoid using “N/A”, “?”, etc. entirely
P20 Seminar November 12, 2009 21
Database Setup – Codebook/Data Dictionary
A written handbook with information on study data:
Study title, PI name, date of last update, DB name and location
# of observations, # of variables Study variables and their attributes [name,
label, location (ASCII), coding (values), format, measurement units]
Other [formulae, weights, scoring documentation, etc.]
P20 Seminar November 12, 2009 22
Data Entry, Verification and Cleaning
Ultimate aim is
a fully-documented backed-up archive of
verified, validated and ready-for-use data
P20 Seminar November 12, 2009 23
Data Entry
“Do it promptly, completely and consistently”
Preferably one trained data entry person [unless double entry]
Unique ID (-s) All the data must be entered in its “raw” form
directly from the original records - NO hand calculations
Frequent back-up
P20 Seminar November 12, 2009 24
Data Verification and Cleaning
Optimally done by a statistician or DM professional in close collaboration with investigator
Includes (but not limited to) general and logic checks to detect errors and outliers, verification of data completeness (subjects and variables)
Audit trail/log book for a complete record of changes made
Following all necessary corrections, ONE FINAL CLEAN DB is created
P20 Seminar November 12, 2009 25
Data Storage
Stored on a password-protected server are
1. ONE INITIAL RAW DB
2. ONE FINAL CLEAN DB
3. CODEBOOK
4. Audit trail or log book [if used] Frequent BACK-UPs are performed All previous DB versions EXCEPT the initial
raw one are destroyed
P20 Seminar November 12, 2009 26
Data Re-Structuring
If not foreseen in advance, may be needed for certain analyses
Usually can be done in statistical packages Keep a record of any re-structuring Use “version-” or “date-numbering” system
P20 Seminar November 12, 2009 27
Data Archiving
At the end of a project, the data, codebook, log-book and programs [syntax] must be archived
The archive serves as a permanent storage and gives access to all project-related information
Keep a copy of the archive and detailed report of the archive’s structure