sas programming: analyticalstatanalysis.weebly.com/uploads/8/1/4/8/8148217/sas...sas programming:...
TRANSCRIPT
-
SAS PROGRAMMING: ANALYTICAL
Eng. Mohammad KHALAF
Mobile: 00962-79-5880413
Email: [email protected]
Webpage: www.statanalysis.weebly.com
mailto:[email protected]://www.statanalysis.weebly.com/
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 2
TABLE OF CONTENTS
Table of Contents ........................................................................................................... 2
Graphics ......................................................................................................................... 3
Univariate ....................................................................................................................... 6
Correlation ................................................................................................................... 13
Kernel Density Estimate .............................................................................................. 15
T-test ............................................................................................................................ 15
Analysis of each variable separately ............................................................................ 18
T-test ............................................................................................................................ 29
One sample t test ...................................................................................................... 29
Paired t-test ............................................................................................................... 30
Correlation Test ........................................................................................................... 31
Independent sample t-test ......................................................................................... 32
ANOVA ....................................................................................................................... 33
Regression analysis ...................................................................................................... 34
ODS in SAS ................................................................................................................. 35
Appendices ................................................................................................................... 37
Questionnaire ........................................................................................................... 38
SAS t-test Commands .............................................................................................. 41
SAS Simple Linear Regression Example ................................................................. 59
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 3
GRAPHICS
To produce simple scatterplot of two variables we use proc gplot as follow:
data graph;
input x y;
datalines;
20 10
15 23
5 14
;
run;
proc print data=graph;
run;
proc gplot;
plot y * x;
run;
Output of analysis part
Graph output which is displayed on graph output windows as follow:
To add line between the different points we use the command
symbol1 i=join;
proc gplot;
plot y * x;
run;
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 4
where i indicates (interpolation)
More additions to graph:
data graph;
input x y;
datalines;
20 10
15 23
5 14
;
run;
proc print data=graph;
run;
symbol 1 v=none i=join;
symbol1 v=square i=join;
symbol2 v=circle i=join;
proc gplot;
plot y * x;
run;
where v indicates value
data graph;
input x y sex;
datalines;
20 10 M
15 23 F
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 5
5 14 M
;
run;
symbol1 v=none i=join c=red;
symbol2 v=none i=join c=red;
proc gplot;
plot y * x = sex;
run;
repeat as
run;
symbol1 v=diamond i=join c=red;
symbol2 v=none i=join c=red;
proc gplot;
plot y * x = sex;
run;
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 6
UNIVARIATE
data water;
input flag $ 1 Town $ Mortal Hardness;
datalines;
Bath 1247 105
*Birkenhead 1668 17
Birmingham 1466 5
*Blackburn 1800 14
*Blackpool 1609 18
*Bolton 1558 10
*Bootle 1807 15
Bournemouth 1299 78
*Bradford 1637 10
Brighton 1359 84
Bristol 1392 73
*Burnley 1755 12
Cardiff 1519 21
Coventry 1307 78
Croydon 1254 96
*Darlington 1491 20
*Derby 1555 39
*Doncaster 1428 39
EastHam 1318 122
Exeter 1260 21
*Gateshead 1723 44
*Grimsby 1379 94
*Halifax 1742 8
*Hudders.eld 1574 9
*Hull 1569 91
Ipswich 1096 138
*Leeds 1591 16
Leicester 1402 37
*Liverpool 1772 15
*Manchester 1828 8
*Middlesbrough 1704 26
*Newcastle 1702 44
Newport 1581 14
Northampton 1309 59
Norwich 1259 133
*Nottingham 1427 27
*Oldham 1724 6
Oxford 1175 107
Plymouth 1486 5
Portsmouth 1456 90
*Preston 1696 6
Reading 1236 101
*Rochdale 1711 13
*Rotherham 1444 14
*StHelens 1591 49
*Salford 1987 8
*Shef.eld 1495 14
Southampton 1369 68
Southend 1257 50
*Southport 1587 75
*SouthShields 1713 71
*Stockport 1557 13
*Stoke 1640 57
*Sunderland 1709 71
Swansea 1625 13
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 7
*Wallasey 1625 20
Walsall 1527 60
WestBromwich 1627 53
WestHam 1486 122
Wolverhampton 1485 81
*York 1378 71
;
run;
proc print data=water;
run;
proc univariate data=water normal;
var mortal hardness;
histogram mortal hardness /normal;
probplot mortal hardness;
run;
The meaning of some of the other statistics printed in these displays are as follows:
Abbreviation Meaning Uncorrected SS Uncorrected sum of squares; simply the sum of squares of the
observations
Corrected SS Corrected sum of squares; simply the sum of squares of deviations
of the observations from the sample mean
Coeff Variation Coefficient of variation; the standard deviation divided by the mean and multiplied by 100
Std Error Mean Standard deviation divided by the square root of the number of
observations
Range Difference between largest and smallest observation in the sample
Interquartile Range Difference between the 25% and 75% quantiles (see values of
quantiles given later in display to confirm)
Student’s t Student’s t -test value for testing that the population mean is zero
Pr>|t| Probability of a greater absolute value for
Sign Test Nonparametric test statistic for testing whether the population
median is zero
Pr>|M| Approximation to the probability of a greater absolute value for the Sign test under the hypothesis that the population median is zero
Signed Rank Nonparametric test statistic for testing whether the population mean
is zero
Pr>=|S| Approximation to the probability of a greater absolute value for the
Sign Rank statistic under the hypothesis that the population
mean is zero
Shapiro-Wilk W Shapiro-Wilk statistic for assessing the normality of the data and the
corresponding P-value (Shapiro and Wilk [1965])
Kolmogorov-Smirnov D Kolmogorov-Smirnov statistic for assessing the normality of the data and the corresponding P-value (Fisher and Van Belle [1993])
Cramer-von Mises W-sq Cramer-von Mises statistic for assessing the normality of the data
and the associated P-value (Everitt [1998])
Anderson-Darling A-sq Anderson-Darling statistic for assessing the normality of the data and the associated P-value (Everitt [1998])
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 8
OUTPUTS
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 9
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 10
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 11
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 12
proc gplot;
plot mortal*hardness;
run;
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 13
CORRELATION
proc corr data=water pearson spearman;
var mortal hardness;
by town;
run;
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 14
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 15
KERNEL DENSITY ESTIMATE
proc kde data=water out=bivest;
var mortal hardness;
proc g3d data=bivest;
plot hardness*mortal=density;
run;
where KDE (Kernel Density Estimate)
T-TEST
data water;
set water;
lhardnes=log(hardness);
if hardness < 100 then T = 1;
else T=2;
proc ttest;
class T;
var mortal hardness lhardnes;
proc npar1way wilcoxon;
class T;
var hardness;
run;
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 16
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 17
Example for application
The questionnaire which is considered the source of this data is existed in appendices.
The following data is part of real data collected through the questionnaire.
data sasuser.book3;
input ser p1 p2 p3 p4 p5 q1 q2 q3 q4 q5
q6 q7 q8 q9 q10 q11 q12 q13 q14;
datalines;
1 2 2 2 3 3 4 4 2 2 4 3
3 3 3 4 4 3 3 3
2 1 2 2 2 1 4 5 4 4 3 2
1 5 3 4 4 4 3 1
3 1 2 1 3 3 4 4 4 5 4 4
4 4 4 4 4 4 4 4
4 1 3 1 3 4 5 5 5 5 5 5
5 5 5 5 5 5 5 5
5 1 2 2 2 3 4 4 4 4 4 4
4 4 4 5 4 4 4 5
6 1 2 2 2 1 2 2 1 2 3 2
1 2 1 2 3 2 3 2
7 1 2 2 3 1 3 2 2 3 3 2
3 2 3 3 3 3 2 3
8 1 1 1 3 1 2 2 1 1 2 2
2 2 2 1 1 2 1 1
9 2 3 2 3 4 1 2 2 2 1 2
1 2 2 2 1 1 2 3
10 1 3 2 3 1 4 3 3 2 4 3
2 5 5 2 4 3 5 4
11 1 3 2 3 1 5 5 4 3 4 4
4 4 5 5 5 4 4 4
12 1 3 2 3 1 5 4 4 4 4 4
4 4 4 3 3 3 2 2
13 1 3 2 3 2 4 4 4 4 3 3
3 4 4 4 4 4 4 4
14 2 3 1 2 4 3 4 4 3 2 4
1 5 3 2 3 3 3 2
15 2 2 2 1 2 4 4 4 3 3 2
3 3 3 3 3 4 3 4
16 2 2 1 1 2 4 2 5 3 2 2
3 5 3 4 4 1 3 2
17 2 2 2 2 2 4 3 3 2 3 3
4 4 4 4 5 5 5 3
18 2 2 4 2 1 2 1 4 1 2 2
2 1 2 2 3 3 2 1
19 2 3 2 2 4 4 4 4 4 3 4
3 4 4 5 4 4 4 5
20 2 3 1 2 4 4 4 4 3 4 3
2 4 3 4 4 4 4 3
run; or
proc import out=sasuser.book3
datafile="C:\Users\Mohd
KHALAF\Desktop\SAS_Training\samples\book3.xls"
DBMS= Excel Replace;
GETNAMES=YES;
Run;
To see the file contents use the following procedure:
proc contents data=sasuser.book3;
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 18
run;
The output will be as follow introducing complete information about the database:
ANALYSIS OF EACH VARIABLE SEPARATELY
To analyze the previous data we need first to describe each of the demographic
variables alone. The second stage will describe the other questions through finding the
proper analysis to figure out the trends of sample for each of these paragraphs. To
start our analysis we find the frequencies and percentage for each demographic
questions, then find the distribution of different demographic on each other and
testing if that distribution is significant or not.
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 19
If the frequency will be done for all variables in database we use the following
command:
proc freq data=sasuser.book3;
run;
Output will be:
But as the variables of the second part of the questionnaire can be analyzed other type
of tests and it is not sense to do frequency or any type of analysis for serial not
variable (ser). Then the frequencies will be made for p1 to p5 only using the following
procedure:
proc freq data=sasuser.book3;
tables p1 p2 p3 p4 p5;
run;
The output will be as follow:
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 20
Through the previous tables, it is possible to describe the first five demographic
variables separately.
To make our output more readable, the variable labels and value labels should be
added. To add variable labels, the following program can be used:
data sasuser.book3;
set sasuser.book3;
label p1= "الجنس"
p2 = "العمر"
p3= "المستوى التعليمي"
p4= "المستوى اإلداري"
p5= "عدد سنوات الخبرة"
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 21
run;
proc contents data=sasuser.book3;
run;
The output of the contents procedures will be as follow:
The results for the analysis for the frequency for p1 and p2 will be as follow if using
the procedure:
proc freq data=sasuser.book3;
tables p1 p2;
run;
The results will be as follow showing the labels:
To add value labels for variables, the procedure will be as follow:
data sasuser.book3;
set sasuser.book3;
proc format;
value p1f 1="ذكر"
;"أنثى"=2
value p2f 1="أقل من 52 سنة"
"اقل من 52 –25"=2
"اقل من 52- 35"=3
"أقل من 22- 45"=4
;"فأكثر 55"=5
value p3f 1="دبلوم متوسط فأقل"
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 22
"بكالوريوس"=2
"دبلوم عالي"=3
"ماجستير"=4
;"دكتوراه"=5
value p4f 1="إدارة عليا"
"إدارة وسطى"=2
;"إدارة إشرافيه"=3
value p5f 1="أقل من 2 سنوات"
"أقل من 01 سنوات– 5"=2
"أقل من 02 سنة – 10"=3
;"سنة فأكثر 15"=4
run;
proc freq data=sasuser.book3;
format p1 p1f.
p2 p2f.
p3 p3f.
p4 p4f.
p5 p5f.;
tables p1 p2 p3 p4 p5;
run;
The output will be follow:
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 23
To have more information about the demographic features of the studied sample,
crosstabualtion will make it possible to do so and use the following procedure:
proc freq data=sasuser.book3;
proc format;
value p1f 1="ذكر"
;"أنثى"=2
value p2f 1="أقل من 52 سنة"
"اقل من 52 –25"=2
"اقل من 52- 35"=3
"أقل من 22- 45"=4
;"فأكثر 55"=5
value p3f 1="دبلوم متوسط فأقل"
"بكالوريوس"=2
"دبلوم عالي"=3
"ماجستير"=4
;"دكتوراه"=5
value p4f 1="إدارة عليا"
"إدارة وسطى"=2
;"إدارة إشرافيه"=3
value p5f 1="أقل من 2 سنوات"
"أقل من 01 سنوات– 5"=2
"أقل من 02 سنة – 10"=3
;"سنة فأكثر 15"=4
run;
proc freq data=sasuser.book3;
format p1 p1f.
p2 p2f.
p3 p3f.
p4 p4f.
p5 p5f.;
tables p1*p2 p1*p3 p1*p4 p1*p5/chisqr;
run;
The output for this analysis is as follow:
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 24
To analyze the second part of the questionnaire, descriptive statistics will be used
concentrating on the use of mean and standard deviation for the questions q1-q14; the
following procedure can be used:
proc means data=sasuser.book3;
run;
Which will give means for all database variables as follow:
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 25
As it is not since to include the first part that has been already analyzed, then the
following procedure will be followed to get the means for the second part only as
follow:
proc means data=sasuser.book3;
var q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q13 q14_;
run;
The output will be:
If the output needs to be limited to mean or any other output the procedure will be as
follow:
proc means N mean std data=sasuser.book3;
var q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q13 q14_;
run;
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 26
If it was recognized that the name of variable q14 was written wrongly to q14_. The
name change of variable q14_ to q14 can be done using the following procedure:
data sasuser.book4;
set sasuser.book3;
rename q14_=q14;
run;
To read the output with much not necessary decimals makes dealing with output
disturbing. To minimize the number of decimal to the number preferred, the statement
can be used (maxdec=2) and can be used with the procedure as follow:
proc means N mean std maxdec=2 data=sasuser.book4;
var q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14;
run;
The output will be as follow:
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 27
To insure that the mean results are correct, the scale of agreements should be (5) for
absolutely agree and (1) for absolutely not agree. This indicates that the means are not
correct for q1 to q14 as the codes given to the agreements are on the contrary order.
To correct the codes, recode process should be done to change (5 to 1), (4 to 2), (3 to
3), (2 to 4) and (1 to 5).
data sasuser.book3;
set sasuser.book3;
qq1 = 6- q1;
qq2 = 6- q2;
qq3 = 6- q3;
qq4 = 6- q4;
qq5 = 6- q5;
qq6 = 6- q6;
qq7 = 6- q7;
qq8 = 6- q8;
qq9 = 6- q9;
qq10 = 6- q10;
qq11 = 6- q11;
qq12 = 6- q12;
qq13 = 6- q13;
qq14 = 6- q14;
run;
proc freq data=sasuser.book4;
tables q1-q9;
run;
To have complete and comprehensive analysis, this requires the distribution of results
in questions q1 to q14 by the demographic variables available. To do so, The
procedure used is as follow:
proc sort data=sasuser.book4;
by p1;
run;
proc means data=sasuser.book4;
var q1;
by p1;
run;
The output for the distribution will be:
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 28
The previous analysis can be done for all variables q1-q14 in one step as follow:
proc sort data=sasuser.book4;
by p1;
run;
proc means data=sasuser.book4;
var q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14;
by p1;
run;
The output will be:
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 29
The same analysis will be conducted with p2 up to p3.
T-TEST
One sample t test
There are three type of t-test than can be applied on the running example. The first
type of t-test is one sample t-test.
proc ttest h0=3 alpha=0.1 data=sasuser.book4;
var q1;
run;
The output will be:
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 30
The same test can be repeated for q2 to q14 to measure if there is significant
differences from the hypothetical mean for the variables.
proc ttest h0=3 alpha=0.1 data=sasuser.book4;
var q1-q14;
run;
Paired t-test
The second type of t-test that can be applied to the current questionnaire is the paired
t-test.
proc ttest data=sasuser.book4;
paired q1*q4 q1*q2;
run;
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 31
CORRELATION TEST
To figure out if two variables are correlation to each other or not correlation test is
used. The procedure of correlation will be:
proc corr data=sasuser.book4;
var q1 q2;
run;
The output will be:
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 32
The result shows that there is high correlation between the two variables which
matches the paired sample t-test.
Independent sample t-test
The third type of hypothesis testing concerning t-test is the independent sample t-
test. This test includes two variables. The first variable should be categorical while
the second variable should be continuous. So, the test can be done between the sex
(p1) vs q1 to q14. The procedure applied will be as follow:
proc ttest data=sasuser.book4;
class p1;
var q1-q14;
run;
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 33
ANOVA
To test the effect of educational level, employee position and experience on the
attitudes for the questions q1 to q14, the analysis of variance will be used. The
procedure applied in the analysis of variance will be:
proc anova data=sasuser.book4;
class p2;
model q1 = p2;
run;
To run the process correctly the tests in general are not hold for each question. A look
to the questionnaire in Appendix I shows that q1-q7 represent one field, while q8-q14
represents another field in the survey. So, the mean for each field should be calculated
in new variable to be used to be tested by the demographic variables. The process can
be done as follow:
data sasuser.book4;
set sasuser.book4;
q = (q1 + q2 + q3 + q4 + q5 + q6 + q7)/7;
run;
and for the q8 to q14:
data sasuser.book4;
set sasuser.book4;
qq = (q8 + q9 + q10 + q11 + q12 + q13)/6;
run;
Then the ANOVA analysis can be handled with q and qq with the demographic
variables.
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 34
REGRESSION ANALYSIS
If the effect of q will be measured on qq, this effect can be measured using regression
analysis as analysis tool. The q variable will be independent variable, while qq will be
dependent variable in the regression analysis. The procedure is as follow:
proc reg data=sasuser.book4;
model qq=q;
run;
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 35
ODS IN SAS
The Output Delivery System (ODS) provides a way to manage SAS output. The SAS
output can be directed to be received by other software. It can be received in Rich
Text Format, HTML, or other forms. The output can be read to other software using
the following procedure:
ODS RTF;
proc means N mean std maxdec=2 data=sasuser.book4;
var q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q13 q14;
run;
ODS RTF Close;
The output will be move to Microsoft office in Rich Text Format file. The output will
as follow inside SAS:
In Microsoft Office it looks like:
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 36
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 37
APPENDICES
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 38
Questionnaire
بسم اهلل الرحمن الرحيم السيد/السيدة المحترم
تحية طيبة وبعد،،،
استبانه تهدف إلى قياس أثر التدريب في تحسين أداء العاملين. لزيادة تهدف الدراسة تقديم مقترحات لتحسين البرامج التدريبية التي تعتمدها مؤسستكم الموقرة
أدائها بما يخدم تحقيقها ألهدافكم.
وأن حرصكم على تقديم البيانات والمعلوماات المطلوباة بدقاة وموياوعية سيسااهم وف ا فاي مساعدة في التوصل إلى نتاائج أد وتقاديم توصايات ات الالتوصل إلى نتائج أفيل، وبالتالي
ساتمارة المرفقاة، وبماا يتناساب والبارامج فائدة أكبار. لا ا نر،او، التكارم بالت اير علاى فقارات اف .مركزكمالتدريبية المطبقة في
نر،و العلم، ب ن البيانات والمعلومات التاي ساتوفرونها لها ل الدراسااااة ستساتخدم فقاط أل ارا البحث العلمي، وستعامل بسرية تامة، وسيتم تزويدكم بنتاائج الدراساة فاي حالاة افنتهااء منهاا
عليها. عبافطالإ ا ر بتم
اكرين لكم حسن تعاونكم ،،،
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 39
ال،زء األول : ( في المكان المناسب.Xالر،اء ويع إ ارة )
الخصائص ال خصية والوظيفية ال،نس : - ) ( أنثى ) ( كر :العمر - 52اقل من - 52) ( 52اقل من –52سنة ) ( 52( أقل من )
ف كثر. 22 ) ( 22أقل من - 52) ( المستوى التعليمي: - ) ( دبلوم متوسط ف قل ) ( بكالوريوس ) ( دبلوم عالي
) ( ما،ستير ) ( دكتورال المستوى اإلداري : -
إ رافيه) ( إدارة ) ( إدارة وسطى ) ( إدارة عليا عدد سنوات الخبرة : -
سنوات 01أقل من – 2) ( سنوات 2) ( أقل من سنة ف كثر 02) ( سنة 02أقل من – 01) (
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 40
ال،زء الثاني: ( في المكان ال ي ترال مناسبا xويع إ ارة )ير،ى
:)التدريب(المتغيرات المستقلة الفقرة
أوف : تحديد افحتيا،ات التدريبيةأواف ب دة
ير مواف مت كد
ير مواف
ير مواف ب دة
في نجاح عملية التدريبساعد تحديد االحتياجات التدريبية بشكل فاعل -1 يتم توضيح األهداف الخاصة بالبرنامج التدريبي بشكل واضح ودقيق -2 يتم اختيار البرامج التدريبية وفق احتياجات العاملين والعمل -3 يتم تصميم البرامج التدريبية بمنهجية علمية -4تحرص اإلدارة على التعرف على احتياجات الموظفين التدريبية لتحسين -5
مستوى أدائهم
تحرص اإلدارة على تحديد مواطن ضعف األداء لدى العاملين -6 تبحث اإلدارة عن أسباب الخطأ في األداء وتعمل على التخلص منه -7
ثانيا : كفاءة برامج التدريب ُتعد البرامج التدريبية المنفذة من افضل الوسائل لتحسين اداء العمل -1 ساهم التدريب في تخفيض االعباء المتعلقة بالوظيفة داخل القسم -2 ساهم توظيف الطرق العلمية المتطورة في زيادة كفاءة برامج التدريب -3ساهمت البرامج التدريبية في اكتساب العاملين مهارات ومعارف تم تطبيقها -4
في المؤسسة
ساعدت البرامج التدريبية في امتالك الموظف لروح المنافسة -5 تتيح البرامج التدريبية للعاملين فرصة الممارسة العملية -6 لخلق كوادر متميزةتسهم الشركة بإعداد برامج تدريبية -7
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 41
SAS t-test Commands
This handout illustrates how to read in raw data to SAS, set up missing values and create
new variables using transformations and recodes. We illustrate independent samples t-tests,
paired t-tests, and one-sample t-tests.
Read in Raw Data
In the first data step, we read in the raw data using an infile and input statement. We don't
need to tell SAS the column location of each variable, because there is at least one blank
between variables, so we can use a free-format input statement where the variables are
simply listed in the order they appear in the raw data file.
/*Read in the raw data*/
data owen;
infile "owen.dat" ;
input family child age sex race w_rank income_c height weight hemo
vit_c vit_a head_cir fatfold b_weight mot_age b_order
m_height
f_height ;
run;
Create a Permanent Dataset
After reading in the raw data, we create a new permanent SAS dataset in which we set up
missing values and create new variables using recodes and transformations. Note in setting
up the missing value codes, a dot (.) is used for the missing value code and no quotes are
employed, because all of these variables are numeric. Although we used two data steps in
this example, all of this code could have been accomplished in a single data step.
libname b510 "c:\documents and settings\kwelch\desktop\b510";
data b510.owen;
set owen;
if height = 999 then height = .;
if weight = 999 then weight = .;
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 42
if vit_a = 99 then vit_a = .;
if head_cir = 99 then head_cir = .;
if fatfold = 99 then fatfold = .;
if b_weight = 999 then b_weight= .;
if mot_age = 99 then mot_age = .;
if b_order = 99 then b_order = .;
if m_height = 999 then m_height=.;
if f_height = 999 then f_height=.;
bwt_g = b_weight*10;
if bwt_g not=. and bwt_g < 2500 then lowbwt=1;
if bwt_g >=2500 then lowbwt=0;
log_fatfold = log(fatfold);
htdiff = f_height - m_height;
bmi = weight /(height/100)**2;
run;
Basic Descriptive Statistics
It is always good practice to check a dataset after you have created it. Proc Means is useful
for numeric variables. Be especially attentive to the number of observations (N) and the
minimum and maximum value for each variable. Check to see that they are reasonable.
/*Simple Descriptive Statistics on all Numeric Variables*/
proc means data=b510.owen;
run;
The MEANS Procedure
The MEANS Procedure
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 43
Variable N Mean Std Dev Minimum
Maximum
-----------------------------------------------------------------------------
------
family 1006 4525.11 1634.03 2000.00 7569.00
child 1006 1.3359841 0.5716672 1.0000000 3.0000000
age 1006 44.0248509 16.6610452 12.0000000 73.0000000
sex 1006 1.4890656 0.5001291 1.0000000 2.0000000
race 1006 1.2823062 0.4503454 1.0000000 2.0000000
w_rank 1006 2.2127237 0.9024440 1.0000000 4.0000000
income_c 1006 1581.31 974.2279710 80.0000000 6250.00
height 1001 99.0429570 11.4300111 70.0000000 130.0000000
weight 1000 15.6290800 3.6523446 8.2400000 41.0800000
hemo 1006 12.4606362 1.1578850 6.2000000 24.1000000
vit_c 1006 1.1302187 0.6599121 0.1000000 3.5000000
vit_a 763 36.0380079 8.8951237 15.0000000 78.0000000
head_cir 999 49.3763764 2.0739057 39.0000000 56.0000000
fatfold 993 4.4562941 1.6683194 2.6000000 42.0000000
b_weight 986 325.0517241 59.5162936 91.0000000 544.0000000
mot_age 981 29.2660550 6.2603025 17.0000000 51.0000000
b_order 980 2.9479592 2.1939526 1.0000000 16.0000000
m_height 980 163.7632653 6.3663343 122.0000000 199.0000000
f_height 975 178.2194872 7.3821354 152.0000000 210.0000000
bwt_g 986 3250.52 595.1629357 910.0000000 5440.00
lowbwt 986 0.1075051 0.3099115 0 1.0000000
log_fatfold 993 1.4599658 0.2396859 0.9555114 3.7376696
htdiff 972 14.4218107 8.7834139 -12.0000000 56.0000000
bmi 998 15.8124399 1.6634700 11.0247934 26.2912000
-----------------------------------------------------------------------------------
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 44
Descriptives for Subgroups using a Class Statement
A Class statement can be used with Proc Means to get descriptive statistics for subgroups of
cases. You don't have to sort the data when using a class statement.
proc means data=b510.owen;
class sex;
var bwt_g bmi fatfold log_fatfold;
run;
The MEANS Procedure
N
SEX Obs Variable Label N Mean Std Dev
Minimum Maximum
--------------------------------------------------------------------------------------
----------------------
1 514 bwt_g 497 3340.56 565.3268435 1360.00
5170.00
bmi 510 15.8982386 1.6074313 11.3795135 26.2912000
FATFOLD FATFOLD 507 4.2518738 0.9720458 2.6000000 10.2000000
log_fatfold 507 1.4247028 0.2076417 0.9555114 2.3223877
2 492 bwt_g 489 3159.00 611.1350784 910.0000000
5440.00
bmi 488 15.7227732 1.7171565 11.0247934 24.4485835
FATFOLD FATFOLD 486 4.6695473 2.1489049 2.6000000 42.0000000
log_fatfold 486 1.4967524 0.2643232 0.9555114 3.7376696
--------------------------------------------------------------------------------------
Descriptives for Subgroups using a By Statement
A By statement is another way to get information for subgroups of cases. You need to sort
the data first when using a By statment. The By statement is more generally applicable than
the Class statement and can be used with most SAS procedures (e.g. Proc Reg, Proc Freq). To
avoid too much output, use a By statement only for variables that have a limited number of
levels.
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 45
proc sort data=b510.owen;
by sex;
run;
proc means data=b510.owen;
by sex;
var bwt_g bmi fatfold log_fatfold;
run;
-------------------------------------------- SEX=1 -----------------------------------
The MEANS Procedure
Variable Label N Mean Std Dev Minimum
Maximum
--------------------------------------------------------------------------------------
bwt_g 497 3340.56 565.3268435 1360.00
5170.00
bmi 510 15.8982386 1.6074313 11.3795135
26.2912000
FATFOLD FATFOLD 507 4.2518738 0.9720458 2.6000000
10.2000000
log_fatfold 507 1.4247028 0.2076417 0.9555114
2.3223877
--------------------------------------------------------------------------------------
-------------------------------------------- SEX=2 -----------------------------------
Variable Label N Mean Std Dev Minimum
Maximum
--------------------------------------------------------------------------------------
bwt_g 489 3159.00 611.1350784 910.0000000
5440.00
bmi 488 15.7227732 1.7171565 11.0247934
24.4485835
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 46
FATFOLD FATFOLD 486 4.6695473 2.1489049 2.6000000
42.0000000
log_fatfold 486 1.4967524 0.2643232 0.9555114
3.7376696
--------------------------------------------------------------------------------------
Boxplots
Boxplots are a nice way to visualize data when you wish to compare the value of a
continuous variable for two or more groups. In SAS 9.2, you can use Proc Sgplot to get
boxplots. Proc Boxplot can be used in earlier versions of SAS, and in SAS 9.2.
/*Boxplots*/
proc sgplot data=b510.owen;
vbox bwt_g / category=sex;
run;
proc sgplot data=b510.owen;
vbox bmi / category=sex;
run;
proc sgplot data=b510.owen;
vbox fatfold / category=sex;
run;
proc sgplot data=b510.owen;
vbox log_fatfold / category=sex;
run;
The boxplots show the median, upper and lower quartiles, give an idea of skewness, and
indicate outliers.
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 47
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 48
Independent Samples t-test
An independent samples t-test can be used to compare the mean of a continuous
variable (e.g., birthweight), for two groups of cases. In this example, we are
comparing the means of BWT_G, WEIGHT, and LOG_FATFOLD for females vs.
males. Notice that Proc ttest uses a class statement for an independent samples t-
test—no sorting of the data is necessary.
The assumptions for the t-test are that the observations are independent (i.e., the
values of individuals are not correlated), that the underlying distribution of the
continuous variable is normal within the two groups, and that the variances in the two
groups are equal. The t-test is robust to departures from the normality assumption, if
the sample size is large (e.g. 50 or more cases). The equality of variances is a more
important assumption. SAS gives a test of equality of variances at the bottom of the t-
test output. If equality of variances is a reasonable assumption, the F-test for equality
of variances will not be significant. We often use a somewhat higher alpha level than
usual for this equality of variances test (e.g., p>.10) to be more conservative (i.e., we
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 49
don't want to wrongly assume equal variances, when in fact they are unequal). SAS
produces two different t-test results, the first one assumes equality of variances and
the second one does not. You can choose the test to use based on the results of the
equality of variances test. By default, SAS always reports a two-sided p-value for the
t-test.
proc ttest data=b510.owen;
class sex;
var bwt_g weight log_fatfold;
run;
Variable: bwt_g
SEX N Mean Std Dev Std Err Minimum Maximum
1 497 3340.6 565.3 25.3584 1360.0 5170.0
2 489 3159.0 611.1 27.6365 910.0 5440.0
Diff (1-2) 181.6 588.5 37.4840
SEX Method Mean 95% CL Mean Std Dev 95% CL
Std Dev
1 3340.6 3290.7 3390.4 565.3 532.2
602.8
2 3159.0 3104.7 3213.3 611.1 575.1
652.0
Diff (1-2) Pooled 181.6 108.0 255.1 588.5 563.6
615.7
Diff (1-2) Satterthwaite 181.6 108.0 255.2
Method Variances DF t Value Pr > |t|
Pooled Equal 984 4.84
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 50
Variable: bmi
SEX N Mean Std Dev Std Err Minimum Maximum
1 510 15.8982 1.6074 0.0712 11.3795 26.2912
2 488 15.7228 1.7172 0.0777 11.0248 24.4486
Diff (1-2) 0.1755 1.6620 0.1052
SEX Method Mean 95% CL Mean Std Dev 95% CL
Std Dev
1 15.8982 15.7584 16.0381 1.6074 1.5145
1.7126
2 15.7228 15.5700 15.8755 1.7172 1.6158
1.8322
Diff (1-2) Pooled 0.1755 -0.0311 0.3820 1.6620 1.5921
1.7383
Diff (1-2) Satterthwaite 0.1755 -0.0314 0.3823
Method Variances DF t Value Pr > |t|
Pooled Equal 996 1.67 0.0958
Satterthwaite Unequal 984.1 1.66 0.0963
Equality of Variances
Method Num DF Den DF F Value Pr > F
Folded F 487 509 1.14 0.1407
Variable: log_fatfold
SEX N Mean Std Dev Std Err Minimum Maximum
1 507 1.4247 0.2076 0.00922 0.9555 2.3224
2 486 1.4968 0.2643 0.0120 0.9555 3.7377
Diff (1-2) -0.0720 0.2371 0.0151
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 51
SEX Method Mean 95% CL Mean Std Dev 95% CL
Std Dev
1 1.4247 1.4066 1.4428 0.2076 0.1956
0.2213
2 1.4968 1.4732 1.5203 0.2643 0.2487
0.2821
Diff (1-2) Pooled -0.0720 -0.1016 -0.0425 0.2371 0.2271
0.2480
Diff (1-2) Satterthwaite -0.0720 -0.1017 -0.0424
Method Variances DF t Value Pr > |t|
Pooled Equal 991 -4.79
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 52
N Mean Std Dev Std Err Minimum Maximum
972 14.4218 8.7834 0.2817 -12.0000 56.0000
Mean 95% CL Mean Std Dev 95% CL Std Dev
14.4218 13.8689 14.9747 8.7834 8.4096 9.1923
DF t Value Pr > |t|
971 51.19
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 53
Mean 95% CL Mean Std Dev 95% CL Std Dev
14.4352 13.6374 15.2331 9.0257 8.4958 9.6266
DF t Value Pr > |t|
493 35.55 |t|
477 36.91 | t |).
proc ttest data=b510.owen;
var htdiff;
run;
The TTEST Procedure
Variable: htdiff
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 54
N Mean Std Dev Std Err Minimum Maximum
972 14.4218 8.7834 0.2817 -12.0000 56.0000
Mean 95% CL Mean Std Dev 95% CL Std Dev
14.4218 13.8689 14.9747 8.7834 8.4096 9.1923
DF t Value Pr > |t|
971 51.19 |t|
971 -2.05 0.0404
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 55
One-sample t-test using Proc Univariate
Proc Univariate can also be used to carry out a one-sample t-test, to get more
information about the distribution of a variable, and to look at a histogram of the
distribution of the variable.
proc univariate data=b510.owen;
var htdiff;
histogram / normal;
run;
The UNIVARIATE Procedure
Variable: htdiff
Moments
N 972 Sum Weights 972
Mean 14.4218107 Sum Observations 14018
Std Deviation 8.78341392 Variance 77.1483601
Skewness 0.31703251 Kurtosis 0.56094005
Uncorrected SS 277076 Corrected SS 74911.0576
Coeff Variation 60.9036833 Std Error Mean 0.28172813
Basic Statistical Measures
Location Variability
Mean 14.42181 Std Deviation 8.78341
Median 15.00000 Variance 77.14836
Mode 15.00000 Range 68.00000
Interquartile Range 12.00000
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 56
Tests for Location: Mu0=0
Test -Statistic- -----p Value------
Student's t t 51.19052 Pr > |t| = |M| = |S|
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 57
Missing Values
-----Percent Of-----
Missing Missing
Value Count All Obs Obs
. 34 3.38 100.00
Fitted Normal Distribution for htdiff
Parameters for Normal Distribution
Parameter Symbol Estimate
Mean Mu 14.42181
Std Dev Sigma 8.783414
Goodness-of-Fit Tests for Normal Distribution
Test ----Statistic----- ------p Value------
Kolmogorov-Smirnov D 0.07149425 Pr > D W-Sq A-Sq
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 58
10.0 3.0000 3.16541
25.0 8.0000 8.49749
50.0 15.0000 14.42181
75.0 20.0000 20.34613
90.0 25.0000 25.67821
95.0 29.0000 28.86924
99.0 37.0000 34.85509
One-sample t-test using Proc Univariate with a specified null hypothesis value for the
mean
We can also specify a null hypothesis value for the mean when using Proc Univariate by
using the mu0 option.
proc univariate data=b510.owen mu0=15;
var htdiff;
run;
Tests for Location: Mu0=15
Test -Statistic- -----p Value------
Student's t t -2.0523 Pr > |t| 0.0404
Sign M -40 Pr >= |M| 0.0071
Signed Rank S -18300 Pr >= |S| 0.0121
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 59
SAS Simple Linear Regression Example
This handout gives examples of how to use SAS to generate a simple linear regression plot, check the correlation
between two variables, fit a simple linear regression model, check the residuals from the model, and also shows
some of the ODS (Output Delivery System) output in SAS.
Read in Raw Data
We first read in the raw data from the werner2.dat raw dataset, and set up the missing value codes using a data
step, and then check descriptive statistics for the numeric variables, using Proc Means.
OPTIONS FORMCHAR="|----|+|---+=|-/\*";
libname b510 "C:\Users\kwelch\Desktop\B510";
DATA b510.werner;
INFILE "C:\Users\kwelch\Desktop\B510\werner2.dat";
INPUT ID 1-4 AGE 5-8 HT 9-12 WT 13-16
PILL 17-20 CHOL 21-24 ALB 25-28 1
CALC 29-32 1 URIC 33-36 1;
IF HT = 999 THEN HT = .;
IF WT = 999 THEN WT = .;
IF CHOL = 600 THEN CHOL = .;
IF ALB = 99 THEN ALB = .;
IF CALC = 99 THEN CALC = .;
IF URIC = 99 THEN URIC = .;
run;
/*Check the Data*/
title "DESCRIPTIVE STATISTICS";
proc means data=b510.werner;
run;
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 60
DESCRIPTIVE STATISTICS
The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum
-------------------------------------------------------------------------------
ID 188 1598.96 1057.09 3.0000000 3519.00
AGE 188 33.8191489 10.1126942 19.0000000 55.0000000
HT 186 64.5107527 2.4850673 57.0000000 71.0000000
WT 186 131.6720430 20.6605767 94.0000000 215.0000000
PILL 188 1.5000000 0.5013351 1.0000000 2.0000000
CHOL 187 235.1550802 44.5706219 50.0000000 390.0000000
ALB 186 4.1112903 0.3579694 3.2000000 5.0000000
CALC 185 9.9621622 0.4795556 8.6000000 11.1000000
URIC 187 4.7705882 1.1572312 2.2000000 9.9000000
-------------------------------------------------------------------------------
Correlation
We now check the correlation between the response (or dependent) variable, CHOL, and the predictor (or
independent) variable, AGE. It is positive, and significant (r = .369, p
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 61
Variable N Mean Std Dev Sum Minimum Maximum
AGE 188 33.81915 10.11269 6358 19.00000 55.00000
CHOL 187 235.15508 44.57062 43974 50.00000 390.00000
Pearson Correlation Coefficients
Prob > |r| under H0: Rho=0
Number of Observations
AGE CHOL
AGE 1.00000 0.36923
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 62
Simple Linear Regression
We now fit a linear regression model, with CHOL as the Y (dependent or outcome) variable and AGE as the X
(independent or predictor) variable, using Proc Reg. We first illustrate the most basic Proc Reg syntax, and then
show some useful options. The Quit statement is used to tell SAS that there are no more statements coming for
this run of Proc Reg.
The output shows that there is a positive relationship between these two variables. When age increases by one
year, average cholesterol is predicted to increase by 1.62 units, and this is a significant relationship (t(185) = 5.40,
p
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 63
Simple Linear Regression Model with no options
The REG Procedure
Model: MODEL1
Dependent Variable: CHOL
Number of Observations Read 188
Number of Observations Used 187
Number of Observations with Missing Values 1
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 50373 50373 29.20 |t|
Intercept 1 179.96174 10.65564 16.89
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 64
We now include some diagnostic plots using Proc Reg. We also generate a new dataset called OUTREG1 that
contains all of the original variables, plus the predicted value for each observation (PREDICT), the residual (RESID)
and the studentized-deleted residual (RSTUD), and Cook's Distance (COOKD)..
ods graphics on;
title "Simple Linear Regression with Diagnostic Plots";
proc reg DATA=B510.werner;
MODEL CHOL=AGE / stb clb;
OUTPUT OUT=OUTREG1 P=PREDICT R=RESID RSTUDENT=RSTUDENT COOKD=COOKD;
run;quit;
ods graphics off;
The partial output below shows the standardized estimate (obtained with the STB option), which shows the
estimated change in Y (in standard deviation units) when X is increased by one standard deviation. This estimate
is 0.369. We also see the 95% Confidence limits for the parameter estimate, which are form 1.03 to 2.22.
Parameter Estimates
Parameter Standard Standardized
Variable DF Estimate Error t Value Pr > |t| Estimate
Intercept 1 179.96174 10.65564 16.89
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 65
The diagnostic panel shows a series of diagnostic plots for this regression model.
The residual plot below shows a scatterplot with the residuals on the Y-axis and AGE on the X-axis. We want to
look for a lack of pattern in these residuals. We can see that there is one low outlier, at about age 25.
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 66
The fit plot shown below shows the regression model fit, and summarizes some of the statistics for the model.
Check the output dataset
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 67
We now check the output dataset, using Proc Print. We also request that Proc Print display the labels for the
each variable, by using the Label option. We print selected variables for those observations with the absolute
value of the studentized deleted residuals being greater than or equal to 3, using a Where statement.
title "Partial Listing of Output Dataset";
proc print data=outreg1;
where abs(rstud) >=3;
VAR ID AGE CHOL PREDICT RESID RSTUD COOKD LCL UCL LCLM UCLM;
run;
Partial Listing of Output Dataset
Obs ID AGE CHOL PREDICT RESID RSTUD COOKD LCL UCL LCLM UCLM
4 1797 25 50 220.686 -170.686 -4.32214 0.081802 138.358 303.014 212.698 228.674
182 3134 50 390 261.410 128.590 3.20326 0.094792 178.695 344.126 250.106 272.714
Check the residuals for normality
We now check the studentized residuals for normality, using Proc Univariate. This is similar to the output from
the ODS graphics that was shown in the earlier panel.
title "Checking Residuals for Normality";
proc univariate data=outreg1 PLOT NORMAL;
var rstud;
histogram / normal;
qqplot / normal(mu=est sigma=est);
run;
The residuals appear to be fairly normally distributed, but there is at least one very low outlier, which we
identified earlier, when we checked the values in the output dataset.
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 68
Refit the regression model without the cases in question
We now refit the model, but without the two outliers being included, by using a Where statement..
ods graphics on;
title "Rerun the model without two obs";
proc reg data=b510.WERNER;
where id not in (1797, 3134);
model chol=age;
run;quit;
ods graphics off;
We can see the changes in the parameter estimates from the output below.
Checking Residuals for Normality
-4.0 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2
0
5
10
15
20
25
30
35
Perc
ent
Studentized Residual without Current Obs
Checking Residuals for Normality
-3 -2 -1 0 1 2 3
-6
-4
-2
0
2
4
Stu
dentized R
esid
ual w
ithout
Curr
ent
Obs
Normal Quantiles
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 69
Dependent Variable: CHOL
Number of Observations Read 186
Number of Observations Used 185
Number of Observations with Missing Values 1
Analysis of Variance
Sum of Mean
Source DF Squares Square F Value Pr > F
Model 1 38478 38478 25.82 |t|
Intercept 1 186.70039 9.98091 18.71
-
Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 70