today: feb 28
DESCRIPTION
Today: Feb 28. Reading Data from existing SAS dataset One-way ANOVA Reading Le 7:5 Reading C&S 7:A-H. Reading SAS Datasets. Sometimes your “raw” data is already a SAS dataset. LIBNAME tomhs 'c:/my documents/ph5415/' ; PROC CONTENTS DATA =tomhs.bpstudy; - PowerPoint PPT PresentationTRANSCRIPT
Today: Feb 28
• Reading Data from existing SAS dataset
• One-way ANOVA
• Reading Le 7:5
• Reading C&S 7:A-H
Reading SAS Datasets
LIBNAME tomhs 'c:/my documents/ph5415/';PROC CONTENTS DATA=tomhs.bpstudy;PROC PRINT DATA=tomhs.bpstudy (obs=10);RUN;
The libname statement tells SAS which directory (folder) the dataset is in.
DATA=tomhs.bpstudyTells SAS to look for a SAS dataset called bpstudy in the directory referenced by tomhs.
Sometimes your “raw” data is already a SAS dataset
PROC CONTENTS OUTPUTThe CONTENTS Procedure
Data Set Name: TOMHS.BPSTUDY Observations: 902Member Type: DATA Variables: 16Engine: V8 Indexes: 0Created: 9:07 Saturday, February 26, 2005 Observation Length: 128Last Modified: 9:07 Saturday, February 26, 2005 Deleted Observations: 0
-----Alphabetic List of Variables and Attributes-----
# Variable Type Len Pos------------------------------------------ 3 AGE Num 8 16 6 CHOL12 Num 8 40 2 GROUP Num 8 8 8 HDL12 Num 8 56 9 PULSE12 Num 8 6410 PULSEBL Num 8 72 4 SBP12 Num 8 24 5 SBPBL Num 8 32 1 SEX Num 8 0 7 TRIG12 Num 8 4811 WT12 Num 8 8012 WTBL Num 8 8813 cholbl Num 8 9614 hdlbl Num 8 10416 id Char 6 12015 trigbl Num 8 112
PROC PRINT – 10 Observations
C T U U c t G S S H R H L L h h r R B B O I D S S W W o d i O S O A P P L G L E E T T l l g b E U G 1 B 1 1 1 1 B 1 B b b b i s X P E 2 L 2 2 2 2 L 2 L l l l d
1 1 3 54 . 139.5 . . . . 76 . 224.0 205 24 179 A00001
2 2 6 62 129 144.0 241 65 66 80 72 124.0 141.0 260 75 67 A00010
3 2 5 64 118 141.0 307 425 41 80 81 144.0 157.0 228 29 564 A00021
4 1 5 47 . 134.0 . . . . 80 . 214.0 194 66 49 A00023
5 1 3 51 . 132.5 . . . . 73 . 206.5 226 40 53 A00056
6 1 2 62 133 133.0 196 72 44 72 76 211.0 227.5 207 47 126 A00075
7 2 2 59 113 136.0 231 75 61 72 74 125.0 137.0 214 62 119 A00083
8 1 3 63 127 137.5 217 137 35 64 74 195.0 211.5 214 37 165 A00105
9 2 4 64 122 151.0 201 57 44 56 63 150.0 159.5 214 47 133 A00133
10 2 5 52 122 140.0 209 105 57 60 81 168.5 196.5 215 55 105 A00143
Reading a SAS DatasetDATA temp; SET tomhs.bpstudy; sbpdif = sbp12-sbpbl;PROC MEANS DATA=temp;
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
SEX 902 1.3824834 0.4862633 1.0000000 2.0000000GROUP 902 3.7882483 1.7874130 1.0000000 6.0000000AGE 902 54.7727273 6.4039396 44.0000000 69.0000000SBP12 848 124.1002358 15.1891840 87.0000000 187.0000000SBPBL 902 140.3636364 12.4446043 113.5000000 190.0000000CHOL12 849 220.8386337 38.8624342 111.0000000 456.0000000TRIG12 849 106.9634865 62.5307082 24.0000000 592.0000000HDL12 849 45.4923439 12.1059688 18.0000000 102.0000000PULSE12 847 69.3506494 10.0301471 44.0000000 112.0000000PULSEBL 901 73.6925638 8.6698610 48.0000000 109.0000000WT12 848 176.8225236 30.4251368 105.5000000 286.0000000WTBL 902 187.3791574 31.0782720 113.0000000 289.2500000cholbl 900 228.2511111 38.4169684 113.0000000 357.0000000hdlbl 900 43.6122222 11.6124701 17.0000000 97.0000000trigbl 900 131.7366667 76.5211232 17.0000000 815.0000000sbpdif 848 -16.5176887 14.4532685 -75.5000000 30.0000000
Reads in an observation. Replaces the infile and input statements when reading in text data
One-Way Analysis of Variance
• Two-sample t-test; compare means of two groups– Are the means different?
• What if we have more than two groups?
Examples;• compare three different behavioral
interventions• compare 5 different BP drugs
Analysis of Variance
Could compare all pairs of means with t-tests
three groups: A-B, B-C, A-C
five groups:
A-B, A-C, A-D, A-E
B-C, B-D, B-E
C-D, C-E
D-E
Analysis of Variance
Problem - multiple comparisons!!
When performing many tests, may reject null hypothesis by chance (Type I error)
With = 0.05, you allow for possibility of rejecting 1 out of 20 tests by chance
Even if all group means are equal then there is a fairly large chance that one-pair will be different
Analysis of Variance
ANOVA simultaneously tests for difference in k means
• Y - continuous• k samples from k normal distributions
• each size ni, not necessarily equal
• each with possibly different mean• each with constant variance 2
i
Constant variance
ANOVA is robust for violations of constant variance (and normality)
Rule of thumb:
If largest standard deviation is less than twice the smallest standard deviation, you’re ok.
Can sometimes transform to achieve equal variance or normality
Analysis of Variance
Ho: 1 = 2= ... = k
Ha: Not all i equal
For each group i;
ni = number of observations
= sample mean
= sample variance
= overall mean
iY2is
Y
Two-sample t-test is special case; k = 2
Sometimes referred to as a global or omnibus test
Two-sample T-test• Compared means
for two groups
• This compares variation between groups with variation within groups
21
21
11nn
s
yyt
p
Variation Within Groups
Variation Between Groups
ANOVA F-test• Compared means
for all groups
• This compares variation between groups with variation within groups
sF
Variation Within Groups
Variation Between Groups – Compared to Grand Mean 2)( YYi
p
2
Analysis of Variance
Variation for all observations: 2)( YYij
Called the “(corrected) total sum of squares” or SST
Can be divided into two parts: •deviation of individual observation from its sample mean
• deviation of sample means from overall mean
)()( YYYYYY iiijij Similar to regression
Analysis of Variance
)( YYi
)( iij YY Measures variation within samples
Measures variation between samples
Each has a corresponding “sum of squares”
2)( iij YY
2)( YYi
Sum of squares within (SSW)
Sum of squares between (SSB)
Analysis of VarianceEach has a corresponding degrees of freedom (DF)
SST = n-1 dfSSB = k-1 dfSSW = (n-1) - (k-1) = n-k df
Ratio of each sum of squares over its degrees of freedom gives us the mean squares
MSW = SSW / (n-k) = average variation within k samples
MSB = SSB / (k-1) = average variation between k samples
Analysis of VarianceMSW is estimate of the total variance, 2
MSW = SSW/(n-k)
SSW =
Sample variance for ith group,
2)( iij YY
1
)( 22
i
iiji n
YYs
22 )1()( iiiij snYYSSW
)1(
)1( 2
i
ii
n
snMSW = Pooled variance for k groups
Analysis of Variance
The null hypothesis is tested by looking at F ratio:
F = MSB/MSW, compare to F distribution with k-1, n-k df
If variation between groups much greater than variationwithin groups;
F >> 1, reject null hypothesis
F 1, fail to reject null hypothesis
Analysis of Variance
Results often presented in an ANOVA table
Source SS df MS F p-value
Between SSB k-1 MSB MSB/MSW p
Within SSW n-k MSW
Total SST n-1
SAS uses “Model” for “Between” and “Error” for “Within”
ANOVA in SAS; two ways
PROC ANOVA DATA = LIPID; CLASS diet; MODEL lipid = diet; RUN;
PROC GLM DATA = LIPID; CLASS diet; MODEL lipid = diet; RUN;
Both test for differencein mean lipid reductionfor the two diets
PROC ANOVA and GLM
• Almost exactly the same for this case
• GLM is a more general procedure
TOMHS Study
• 6 Treatment groups (Variable GROUP)– Beta-blocker– Calcium channel blocker– Diuretic– Alpha-blocker– ACE inhibitor– Placebo– All Treatments given lifestyle intervention to
lower BP
ANOVA – TOMHS Study
PROC GLM DATA=temp; CLASS group; MODEL sbpdif = group; MEANS group;RUN;
OUTPUTThe GLM Procedure
Class Level Information
Class Levels Values
GROUP 6 1 2 3 4 5 6
Number of observations 902
NOTE: Due to missing values, only 848 observations can be used in this analysis
Creates 5 dummy variables for you
GLM – OUTPUT
The GLM Procedure
Dependent Variable: sbpdif
Sum ofSource DF Squares Mean Square F Value Pr > F
Model 5 13149.8402 2629.9680 13.52 <.0001
Error 842 163785.8945 194.5201
Corrected Total 847 176935.7347
R-Square Coeff Var Root MSE sbpdif Mean
0.074320 -84.43703 13.94705 -16.51769
If H0 is true than F should be near 1
F = 2629.97/194.52
ANOVA TABLE
Pooled (over 6 groups) standard deviation
Estimates
GLM – OUTPUT
Source DF Type I SS Mean Square F Value Pr > F
GROUP 5 13149.84018 2629.96804 13.52 <.0001
Source DF Type III SS Mean Square F Value Pr > F
GROUP 5 13149.84018 2629.96804 13.52 <.0001
If no covariates are in the model this portion of the output will be the same as the ANOVA table because the model includes only GROUP.
The GLM Procedure
Level of ------------sbpdif-----------GROUP N Mean Std Dev
1 126 -20.0555556 15.34747172 121 -17.5289256 11.60806073 124 -21.8467742 14.49771184 129 -16.0697674 14.00052235 127 -17.6023622 13.18448746 221 -10.5950226 14.3539675
Contrasts
PROC GLM DATA=temp; CLASS group; MODEL sbpdif = group; MEANS group; ESTIMATE 'BB vs Placebo' group 1 0 0 0 0 -1 ; ESTIMATE 'CCB vs Placebo' group 0 1 0 0 0 -1 ; ESTIMATE 'Diur vs Placebo' group 0 0 1 0 0 -1 ; ESTIMATE 'AB vs Placebo' group 0 0 0 1 0 -1 ; ESTIMATE 'ACE vs Placebo' group 0 0 0 0 1 -1 ;RUN;The GLM Procedure OUTPUT
Dependent Variable: sbpdif
StandardParameter Estimate Error t Value Pr > |t|
BB vs Placebo -9.4605329 1.55691725 -6.08 <.0001CCB vs Placebo -6.9339030 1.57727142 -4.40 <.0001Diur vs Placebo -11.2517516 1.56489344 -7.19 <.0001AB vs Placebo -5.4747448 1.54534422 -3.54 0.0004ACE vs Placebo -7.0073396 1.55300848 -4.51 <.0001
Compare all Groups
PROC GLM DATA=temp; CLASS group; MODEL sbpdif = group; LSMEANS group/PDIF; RUN;
GLM – OUTPUTThe GLM Procedure Least Squares Means
sbpdif LSMEANGROUP LSMEAN Number
1 -20.0555556 12 -17.5289256 23 -21.8467742 34 -16.0697674 45 -17.6023622 56 -10.5950226 6
Least Squares Means for effect GROUP Pr > |t| for H0: LSMean(i)=LSMean(j)
Dependent Variable: sbpdif
i/j 1 2 3 4 5 6
1 0.1550 0.3103 0.0228 0.1622 <.0001 2 0.1550 0.0156 0.4087 0.9669 <.0001 3 0.3103 0.0156 0.0010 0.0161 <.0001 4 0.0228 0.4087 0.0010 0.3796 0.0004 5 0.1622 0.9669 0.0161 0.3796 <.0001 6 <.0001 <.0001 <.0001 0.0004 <.0001
NOTE: To ensure overall protection level, only probabilities associated with pre-planned comparisons should be use
P-value: Group 1 v Group 2