Download - Lecture 4
Lecture 4
• Ways to get data into SAS
• Some practice programming
• Review of statistical concepts
Getting data into SAS
• DATALINES statement– Data is contained within a data step
• INFILE statement– Data contained in separate file
• PROC IMPORT– Data contained in separate file
* List Directed Input: Reading data values separated by spaces.;
DATA bp; INFILE DATALINES; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C 84 138 93 143D 89 150 91 140A 78 116 100 162A . . 86 155C 81 145 86 140;RUN ;TITLE 'Data Separated by Spaces';PROC PRINT DATA=bp;RUN;
Obs clinic dbp6 sbp6 dbpbl sbpbl
1 C 84 138 93 143 2 D 89 150 91 140 3 A 78 116 100 162 4 A . . 86 155 5 C 81 145 86 140
* List Directed Input: Reading data values separated by commas;
DATA bp; INFILE DATALINES DLM = ',' ; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C,84,138,93,143D,89,150,91,140A,78,116,100,162A,.,.,86,155C,81,145,86,140;RUN ;TITLE 'Data separated by a comma';PROC PRINT DATA=bp;RUN;
* List Directed Input: Reading data values from a .csv type file;
DATA bp; INFILE DATALINES DLM = ',' DSD ; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C,84,138,93,143D,89,150,91,140A,78,116,100,162A,,,86,155C,81,145,86,140;TITLE 'Reading in Data using the DSD Option';PROC PRINT DATA=bp;RUN;
* List Directed Input: Reading data values separated by tabs (.txt files);
DATA bp; INFILE DATALINES DLM = '09'x DSD; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl; DATALINES;C 84 138 93 143D 89 150 91 140A 78 116 100 162A 86 155C 81 145 86 140;TITLE 'Reading in Data separated by a tab';PROC PRINT DATA=bp;RUN;
* Reading data from an external file
DATA bp; INFILE '/home/ph5415/data/bp.csv' DSD FIRSTOBS = 2; INPUT clinic $ dbp6 sbp6 dbpbl sbpbl ;TITLE 'Reading in Data from an External File';PROC PRINT DATA=bp;
clinic,dbp6,sbp6,dbpbl,sbpblC,84,138,93,143D,89,150,91,140A,78,116,100,162A,,,86,155C,81,145,86,140
Content of bp.csv
*Using PROC IMPORT to read in data ;
PROC IMPORT DATAFILE='/home/ph5415/data/bp.csv' OUT = bp
DBMS = csv REPLACE ; GETNAMES = yes;
TITLE 'Reading in Data Using PROC IMPORT';
PROC PRINT DATA=bp;PROC CONTENTS DATA=bp;
The CONTENTS Procedure
Data Set Name: WORK.BP Observations: 5 Member Type: DATA Variables: 5 Engine: V8 Indexes: 0 Created: 18:15 Tuesday, January 25, 2005 Observation Length: 40 Last Modified: 18:15 Tuesday, January 25, 2005 Deleted Observations: 0 Protection: Compressed: NO Data Set Type: Sorted: NO Label:
-----Alphabetic List of Variables and Attributes-----
# Variable Type Len Posƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ1 clinic Char 8 322 dbp6 Num 8 04 dbpbl Num 8 163 sbp6 Num 8 85 sbpbl Num 8 24
Some Definitions
• Statistics: The art and science of collecting, analyzing, presenting, and interpreting numerical data.
• Data: facts and figures that are analyzed• Dataset: All the data collected for a study• Elements: Units in which data is collected
– People, companies, schools, households• Variables: Characteristics measured on elements
– People (height, weight)– Company (number of employees)– Schools (percentage of students who graduate in 5 years)– Households (number of computers owned)
Informal Definition
• Statistics:
In a scientific way gain information about something you do not know
Start With Research Question
• What is the proportion of persons without health insurance in Minnesota?
• Do newer BP medications prevent heart disease compared to older medications?
• What is the relationship between grade point average and SAT scores
• Do persons who eat more F&V have lower risk of developing colon cancer.
• Does the program DARE reduce the risk of young persons trying drugs?
Statistics
Start WithQuestion
Start WithQuestion
Design Study And
Collect Data
Compute SummaryCompute SummaryData to AssessData to Assess
Question.Question.
Compute SummaryCompute SummaryData to AssessData to Assess
Question.Question.
Make Conclusions(Inference)
Make Conclusions(Inference)
Statistical Inference
• Estimation (Chapter 4)
• Hypothesis Testing (Chapter 5)– Comparing population proportions (Chap 6)– Comparing population means (Chap 7)
Common Parameters to Estimate
Parameter Parameter Description
Mean of population
Proportion with a certain trait
Correlation between 2 variables
Difference between 2 means
Difference between 2 proportions
Population standard deviation
Statistical Inference
Population with mean
= ?
Population with mean
= ?
A simple random sampleof n elements is selected
from the population..
The sample data provide a value for
the sample mean . .
The sample data provide a value for
the sample mean . .xx
The value of is used tomake inferences about
the value of .
The value of is used tomake inferences about
the value of .
xx
Sampling
• Sample: a subset of target population
(usually a simple random sample - each sample has equal probability of occurring)
• Different samples yield different estimates
• Trying to understand the population parameter (the “true value”)– It’s usually not possible to measure the population value
Point Estimate
Parameter Point Estimate
Sample mean
Sample proportion
Sample correlation
Difference between 2 sample means
Difference between 2 sample proportions
Sample standard deviation
Interval Estimation
In general, confidence intervals are of the form:
SEestimate 96.1
SE = standard error of your estimate
Estimate = mean, proportion, regression coefficient, odds ratio...
1.96 = for 95% CI based on normal distribution
Estimation“What is the average total cholesterol level for MN
residents?”
Random sample of cholesterol levels
sample mean = sum of values / number of observations
Xn
XX
Estimates the population mean:
Estimation
“What is the average total cholesterol level for MN residents?”
sample standard deviation:
sestimates the
population standard deviation:
1
)( 2
n
XXs
Confidence Interval Example
Suppose sample of 100
mean = 215 mg/dL, standard deviation = 20
95% CI = nsX /96.1
= (215 - 1.96*20/10, 215 + 1.96*20/10) approximately = (211, 219)
ns / = standard error of mean
Properties of Confidence Intervals
• As sample size increases, CI gets smaller– If you could sample the whole population;
• Can use different levels of confidence – 90, 95, 99% common– More confidence means larger interval; so a 90% CI is smaller than a 99% CI
• Changes with population standard deviation– More variable population means larger interval
X
Caution with Confidence Intervals
– Data should be from random sample
– More complicated sampling requires different methods• Example - multistage or stratified sampling
– Outliers can cause problems
– Non-normal data can change confidence level• Skewed data a big problem
– Bias not accounted for• Non-responders
• Target and sampled population different
95% Confidence Intervals with SAS
1) Construct from output
estimate +/- 1.96*SE
2) Provided automatically by some procedures
PROC MEANS DATA = STUDENTS LCLM UCLM;
VAR AGE;