sas programming: analyticalstatanalysis.weebly.com/uploads/8/1/4/8/8148217/sas...sas programming:...

SAS PROGRAMMING: ANALYTICAL

Eng. Mohammad KHALAF

Mobile: 00962-79-5880413

Email: [email protected]

Webpage: www.statanalysis.weebly.com

mailto:[email protected]://www.statanalysis.weebly.com/

Mohammad KHALAF- [email protected] www.statanalysis.weebly.com page 2

TABLE OF CONTENTS

Table of Contents ........................................................................................................... 2

Graphics ......................................................................................................................... 3

Univariate ....................................................................................................................... 6

Correlation ................................................................................................................... 13

Kernel Density Estimate .............................................................................................. 15

T-test ............................................................................................................................ 15

Analysis of each variable separately ............................................................................ 18

T-test ............................................................................................................................ 29

One sample t test ...................................................................................................... 29

Paired t-test ............................................................................................................... 30

Correlation Test ........................................................................................................... 31

Independent sample t-test ......................................................................................... 32

ANOVA ....................................................................................................................... 33

Regression analysis ...................................................................................................... 34

ODS in SAS ................................................................................................................. 35

Appendices ................................................................................................................... 37

Questionnaire ........................................................................................................... 38

SAS t-test Commands .............................................................................................. 41

SAS Simple Linear Regression Example ................................................................. 59


GRAPHICS

To produce simple scatterplot of two variables we use proc gplot as follow:

data graph;

input x y;

datalines;

20 10

15 23

5 14

;

run;

proc print data=graph;

run;

proc gplot;

plot y * x;

run;

Output of analysis part

Graph output which is displayed on graph output windows as follow:

To add line between the different points we use the command

symbol1 i=join;

proc gplot;

plot y * x;

run;


where i indicates (interpolation)

More additions to graph:

data graph;

input x y;

datalines;

20 10

15 23

5 14

;

run;

proc print data=graph;

run;

symbol 1 v=none i=join;

symbol1 v=square i=join;

symbol2 v=circle i=join;

proc gplot;

plot y * x;

run;

where v indicates value

data graph;

input x y sex;

datalines;

20 10 M

15 23 F


5 14 M

;

run;

symbol1 v=none i=join c=red;


proc gplot;

plot y * x = sex;

run;

repeat as

run;

symbol1 v=diamond i=join c=red;


proc gplot;

plot y * x = sex;

run;


UNIVARIATE

data water;

input flag $ 1 Town $ Mortal Hardness;

datalines;

Bath 1247 105

*Birkenhead 1668 17

Birmingham 1466 5

*Blackburn 1800 14

*Blackpool 1609 18

*Bolton 1558 10

*Bootle 1807 15

Bournemouth 1299 78

*Bradford 1637 10

Brighton 1359 84

Bristol 1392 73

*Burnley 1755 12

Cardiff 1519 21

Coventry 1307 78

Croydon 1254 96

*Darlington 1491 20

*Derby 1555 39

*Doncaster 1428 39

EastHam 1318 122

Exeter 1260 21

*Gateshead 1723 44

*Grimsby 1379 94

*Halifax 1742 8

*Hudders.eld 1574 9

*Hull 1569 91

Ipswich 1096 138

*Leeds 1591 16

Leicester 1402 37

*Liverpool 1772 15

*Manchester 1828 8

*Middlesbrough 1704 26

*Newcastle 1702 44

Newport 1581 14

Northampton 1309 59

Norwich 1259 133

*Nottingham 1427 27

*Oldham 1724 6

Oxford 1175 107

Plymouth 1486 5

Portsmouth 1456 90

*Preston 1696 6

Reading 1236 101

*Rochdale 1711 13

*Rotherham 1444 14

*StHelens 1591 49

*Salford 1987 8

*Shef.eld 1495 14

Southampton 1369 68

Southend 1257 50

*Southport 1587 75

*SouthShields 1713 71

*Stockport 1557 13

*Stoke 1640 57

*Sunderland 1709 71

Swansea 1625 13


*Wallasey 1625 20

Walsall 1527 60

WestBromwich 1627 53

WestHam 1486 122

Wolverhampton 1485 81

*York 1378 71

;

run;

proc print data=water;

run;

proc univariate data=water normal;

var mortal hardness;

histogram mortal hardness /normal;

probplot mortal hardness;

run;

The meaning of some of the other statistics printed in these displays are as follows:

Abbreviation Meaning Uncorrected SS Uncorrected sum of squares; simply the sum of squares of the

observations

Corrected SS Corrected sum of squares; simply the sum of squares of deviations

of the observations from the sample mean

Coeff Variation Coefficient of variation; the standard deviation divided by the mean and multiplied by 100

Std Error Mean Standard deviation divided by the square root of the number of

observations

Range Difference between largest and smallest observation in the sample

Interquartile Range Difference between the 25% and 75% quantiles (see values of

quantiles given later in display to confirm)

Student’s t Student’s t -test value for testing that the population mean is zero

Pr>|t| Probability of a greater absolute value for

Sign Test Nonparametric test statistic for testing whether the population

median is zero

Pr>|M| Approximation to the probability of a greater absolute value for the Sign test under the hypothesis that the population median is zero

Signed Rank Nonparametric test statistic for testing whether the population mean

is zero

Pr>=|S| Approximation to the probability of a greater absolute value for the

Sign Rank statistic under the hypothesis that the population

mean is zero

Shapiro-Wilk W Shapiro-Wilk statistic for assessing the normality of the data and the

corresponding P-value (Shapiro and Wilk [1965])

Kolmogorov-Smirnov D Kolmogorov-Smirnov statistic for assessing the normality of the data and the corresponding P-value (Fisher and Van Belle [1993])

Cramer-von Mises W-sq Cramer-von Mises statistic for assessing the normality of the data

and the associated P-value (Everitt [1998])

Anderson-Darling A-sq Anderson-Darling statistic for assessing the normality of the data and the associated P-value (Everitt [1998])


OUTPUTS


proc gplot;

plot mortal*hardness;

run;


CORRELATION

proc corr data=water pearson spearman;


by town;

run;


KERNEL DENSITY ESTIMATE

proc kde data=water out=bivest;


proc g3d data=bivest;

plot hardness*mortal=density;

run;

where KDE (Kernel Density Estimate)

T-TEST

data water;

set water;

lhardnes=log(hardness);

if hardness < 100 then T = 1;

else T=2;

proc ttest;

class T;

var mortal hardness lhardnes;

proc npar1way wilcoxon;

class T;

var hardness;

run;


Example for application

The questionnaire which is considered the source of this data is existed in appendices.

The following data is part of real data collected through the questionnaire.

data sasuser.book3;

input ser p1 p2 p3 p4 p5 q1 q2 q3 q4 q5

q6 q7 q8 q9 q10 q11 q12 q13 q14;

datalines;

1 2 2 2 3 3 4 4 2 2 4 3

3 3 3 4 4 3 3 3

2 1 2 2 2 1 4 5 4 4 3 2

1 5 3 4 4 4 3 1

3 1 2 1 3 3 4 4 4 5 4 4

4 4 4 4 4 4 4 4

4 1 3 1 3 4 5 5 5 5 5 5

5 5 5 5 5 5 5 5

5 1 2 2 2 3 4 4 4 4 4 4

4 4 4 5 4 4 4 5

6 1 2 2 2 1 2 2 1 2 3 2

1 2 1 2 3 2 3 2

7 1 2 2 3 1 3 2 2 3 3 2

3 2 3 3 3 3 2 3

8 1 1 1 3 1 2 2 1 1 2 2

2 2 2 1 1 2 1 1

9 2 3 2 3 4 1 2 2 2 1 2

1 2 2 2 1 1 2 3

10 1 3 2 3 1 4 3 3 2 4 3

2 5 5 2 4 3 5 4

11 1 3 2 3 1 5 5 4 3 4 4

4 4 5 5 5 4 4 4

12 1 3 2 3 1 5 4 4 4 4 4

4 4 4 3 3 3 2 2

13 1 3 2 3 2 4 4 4 4 3 3

3 4 4 4 4 4 4 4

14 2 3 1 2 4 3 4 4 3 2 4

1 5 3 2 3 3 3 2

15 2 2 2 1 2 4 4 4 3 3 2

3 3 3 3 3 4 3 4

16 2 2 1 1 2 4 2 5 3 2 2

3 5 3 4 4 1 3 2

17 2 2 2 2 2 4 3 3 2 3 3

4 4 4 4 5 5 5 3

18 2 2 4 2 1 2 1 4 1 2 2

2 1 2 2 3 3 2 1

19 2 3 2 2 4 4 4 4 4 3 4

3 4 4 5 4 4 4 5

20 2 3 1 2 4 4 4 4 3 4 3

2 4 3 4 4 4 4 3

run; or

proc import out=sasuser.book3

datafile="C:\Users\Mohd

KHALAF\Desktop\SAS_Training\samples\book3.xls"

DBMS= Excel Replace;

GETNAMES=YES;

Run;

To see the file contents use the following procedure:

proc contents data=sasuser.book3;


run;

The output will be as follow introducing complete information about the database:

ANALYSIS OF EACH VARIABLE SEPARATELY

To analyze the previous data we need first to describe each of the demographic

variables alone. The second stage will describe the other questions through finding the

proper analysis to figure out the trends of sample for each of these paragraphs. To

start our analysis we find the frequencies and percentage for each demographic

questions, then find the distribution of different demographic on each other and

testing if that distribution is significant or not.


If the frequency will be done for all variables in database we use the following

command:

proc freq data=sasuser.book3;

run;

Output will be:

But as the variables of the second part of the questionnaire can be analyzed other type

of tests and it is not sense to do frequency or any type of analysis for serial not

variable (ser). Then the frequencies will be made for p1 to p5 only using the following

procedure:


tables p1 p2 p3 p4 p5;

run;

The output will be as follow:


Through the previous tables, it is possible to describe the first five demographic

variables separately.

To make our output more readable, the variable labels and value labels should be

added. To add variable labels, the following program can be used:

data sasuser.book3;

set sasuser.book3;

label p1= "الجنس"

p2 = "العمر"

p3= "المستوى التعليمي"

p4= "المستوى اإلداري"

p5= "عدد سنوات الخبرة"


run;

proc contents data=sasuser.book3;

run;

The output of the contents procedures will be as follow:

The results for the analysis for the frequency for p1 and p2 will be as follow if using

the procedure:


tables p1 p2;

run;

The results will be as follow showing the labels:

To add value labels for variables, the procedure will be as follow:

data sasuser.book3;

set sasuser.book3;

proc format;

value p1f 1="ذكر"

;"أنثى"=2

value p2f 1="أقل من 52 سنة"

"اقل من 52 –25"=2

"اقل من 52- 35"=3

"أقل من 22- 45"=4

;"فأكثر 55"=5

value p3f 1="دبلوم متوسط فأقل"


"بكالوريوس"=2

"دبلوم عالي"=3

"ماجستير"=4

;"دكتوراه"=5

value p4f 1="إدارة عليا"

"إدارة وسطى"=2

;"إدارة إشرافيه"=3

value p5f 1="أقل من 2 سنوات"

"أقل من 01 سنوات– 5"=2

"أقل من 02 سنة – 10"=3

;"سنة فأكثر 15"=4

run;


format p1 p1f.

p2 p2f.

p3 p3f.

p4 p4f.

p5 p5f.;

tables p1 p2 p3 p4 p5;

run;

The output will be follow:


To have more information about the demographic features of the studied sample,

crosstabualtion will make it possible to do so and use the following procedure:


proc format;

value p1f 1="ذكر"

;"أنثى"=2

value p2f 1="أقل من 52 سنة"

"اقل من 52 –25"=2

"اقل من 52- 35"=3

"أقل من 22- 45"=4

;"فأكثر 55"=5

value p3f 1="دبلوم متوسط فأقل"

"بكالوريوس"=2

"دبلوم عالي"=3

"ماجستير"=4

;"دكتوراه"=5

value p4f 1="إدارة عليا"

"إدارة وسطى"=2

;"إدارة إشرافيه"=3

value p5f 1="أقل من 2 سنوات"

"أقل من 01 سنوات– 5"=2

"أقل من 02 سنة – 10"=3

;"سنة فأكثر 15"=4

run;


format p1 p1f.

p2 p2f.

p3 p3f.

p4 p4f.

p5 p5f.;

tables p1*p2 p1*p3 p1*p4 p1*p5/chisqr;

run;

The output for this analysis is as follow:


To analyze the second part of the questionnaire, descriptive statistics will be used

concentrating on the use of mean and standard deviation for the questions q1-q14; the

following procedure can be used:

proc means data=sasuser.book3;

run;

Which will give means for all database variables as follow:


As it is not since to include the first part that has been already analyzed, then the

following procedure will be followed to get the means for the second part only as

follow:


var q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q13 q14_;

run;

The output will be:

If the output needs to be limited to mean or any other output the procedure will be as

follow:

proc means N mean std data=sasuser.book3;

var q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q13 q14_;

run;


If it was recognized that the name of variable q14 was written wrongly to q14_. The

name change of variable q14_ to q14 can be done using the following procedure:

data sasuser.book4;

set sasuser.book3;

rename q14_=q14;

run;

To read the output with much not necessary decimals makes dealing with output

disturbing. To minimize the number of decimal to the number preferred, the statement

can be used (maxdec=2) and can be used with the procedure as follow:

proc means N mean std maxdec=2 data=sasuser.book4;

var q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14;

run;

The output will be as follow:


To insure that the mean results are correct, the scale of agreements should be (5) for

absolutely agree and (1) for absolutely not agree. This indicates that the means are not

correct for q1 to q14 as the codes given to the agreements are on the contrary order.

To correct the codes, recode process should be done to change (5 to 1), (4 to 2), (3 to

3), (2 to 4) and (1 to 5).

data sasuser.book3;

set sasuser.book3;

qq1 = 6- q1;

qq2 = 6- q2;

qq3 = 6- q3;

qq4 = 6- q4;

qq5 = 6- q5;

qq6 = 6- q6;

qq7 = 6- q7;

qq8 = 6- q8;

qq9 = 6- q9;

qq10 = 6- q10;

qq11 = 6- q11;

qq12 = 6- q12;

qq13 = 6- q13;

qq14 = 6- q14;

run;


tables q1-q9;

run;

To have complete and comprehensive analysis, this requires the distribution of results

in questions q1 to q14 by the demographic variables available. To do so, The

procedure used is as follow:

proc sort data=sasuser.book4;

by p1;

run;


var q1;

by p1;

run;

The output for the distribution will be:


The previous analysis can be done for all variables q1-q14 in one step as follow:

proc sort data=sasuser.book4;

by p1;

run;


var q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q14;

by p1;

run;

The output will be:


The same analysis will be conducted with p2 up to p3.

T-TEST

One sample t test

There are three type of t-test than can be applied on the running example. The first

type of t-test is one sample t-test.

proc ttest h0=3 alpha=0.1 data=sasuser.book4;

var q1;

run;

The output will be:


The same test can be repeated for q2 to q14 to measure if there is significant

differences from the hypothetical mean for the variables.

proc ttest h0=3 alpha=0.1 data=sasuser.book4;

var q1-q14;

run;

Paired t-test

The second type of t-test that can be applied to the current questionnaire is the paired

t-test.

proc ttest data=sasuser.book4;

paired q1*q4 q1*q2;

run;


CORRELATION TEST

To figure out if two variables are correlation to each other or not correlation test is

used. The procedure of correlation will be:

proc corr data=sasuser.book4;

var q1 q2;

run;

The output will be:


The result shows that there is high correlation between the two variables which

matches the paired sample t-test.

Independent sample t-test

The third type of hypothesis testing concerning t-test is the independent sample t-

test. This test includes two variables. The first variable should be categorical while

the second variable should be continuous. So, the test can be done between the sex

(p1) vs q1 to q14. The procedure applied will be as follow:

proc ttest data=sasuser.book4;

class p1;

var q1-q14;

run;


ANOVA

To test the effect of educational level, employee position and experience on the

attitudes for the questions q1 to q14, the analysis of variance will be used. The

procedure applied in the analysis of variance will be:

proc anova data=sasuser.book4;

class p2;

model q1 = p2;

run;

To run the process correctly the tests in general are not hold for each question. A look

to the questionnaire in Appendix I shows that q1-q7 represent one field, while q8-q14

represents another field in the survey. So, the mean for each field should be calculated

in new variable to be used to be tested by the demographic variables. The process can

be done as follow:

data sasuser.book4;

set sasuser.book4;

q = (q1 + q2 + q3 + q4 + q5 + q6 + q7)/7;

run;

and for the q8 to q14:

data sasuser.book4;

set sasuser.book4;

qq = (q8 + q9 + q10 + q11 + q12 + q13)/6;

run;

Then the ANOVA analysis can be handled with q and qq with the demographic

variables.


REGRESSION ANALYSIS

If the effect of q will be measured on qq, this effect can be measured using regression

analysis as analysis tool. The q variable will be independent variable, while qq will be

dependent variable in the regression analysis. The procedure is as follow:

proc reg data=sasuser.book4;

model qq=q;

run;


ODS IN SAS

The Output Delivery System (ODS) provides a way to manage SAS output. The SAS

output can be directed to be received by other software. It can be received in Rich

Text Format, HTML, or other forms. The output can be read to other software using

the following procedure:

ODS RTF;

proc means N mean std maxdec=2 data=sasuser.book4;

var q1 q2 q3 q4 q5 q6 q7 q8 q9 q10 q11 q12 q13 q13 q14;

run;

ODS RTF Close;

The output will be move to Microsoft office in Rich Text Format file. The output will

as follow inside SAS:

In Microsoft Office it looks like:


APPENDICES


Questionnaire

بسم اهلل الرحمن الرحيم السيد/السيدة المحترم

تحية طيبة وبعد،،،

استبانه تهدف إلى قياس أثر التدريب في تحسين أداء العاملين. لزيادة تهدف الدراسة تقديم مقترحات لتحسين البرامج التدريبية التي تعتمدها مؤسستكم الموقرة

أدائها بما يخدم تحقيقها ألهدافكم.

وأن حرصكم على تقديم البيانات والمعلوماات المطلوباة بدقاة وموياوعية سيسااهم وف ا فاي مساعدة في التوصل إلى نتاائج أد وتقاديم توصايات ات الالتوصل إلى نتائج أفيل، وبالتالي

ساتمارة المرفقاة، وبماا يتناساب والبارامج فائدة أكبار. لا ا نر،او، التكارم بالت اير علاى فقارات اف .مركزكمالتدريبية المطبقة في

نر،و العلم، ب ن البيانات والمعلومات التاي ساتوفرونها لها ل الدراسااااة ستساتخدم فقاط أل ارا البحث العلمي، وستعامل بسرية تامة، وسيتم تزويدكم بنتاائج الدراساة فاي حالاة افنتهااء منهاا

عليها. عبافطالإ ا ر بتم

اكرين لكم حسن تعاونكم ،،،


ال،زء األول : ( في المكان المناسب.Xالر،اء ويع إ ارة )

الخصائص ال خصية والوظيفية ال،نس : - ) ( أنثى ) ( كر :العمر - 52اقل من - 52) ( 52اقل من –52سنة ) ( 52( أقل من )

ف كثر. 22 ) ( 22أقل من - 52) ( المستوى التعليمي: - ) ( دبلوم متوسط ف قل ) ( بكالوريوس ) ( دبلوم عالي

) ( ما،ستير ) ( دكتورال المستوى اإلداري : -

إ رافيه) ( إدارة ) ( إدارة وسطى ) ( إدارة عليا عدد سنوات الخبرة : -

سنوات 01أقل من – 2) ( سنوات 2) ( أقل من سنة ف كثر 02) ( سنة 02أقل من – 01) (


ال،زء الثاني: ( في المكان ال ي ترال مناسبا xويع إ ارة )ير،ى

:)التدريب(المتغيرات المستقلة الفقرة

أوف : تحديد افحتيا،ات التدريبيةأواف ب دة

ير مواف مت كد

ير مواف

ير مواف ب دة

في نجاح عملية التدريبساعد تحديد االحتياجات التدريبية بشكل فاعل -1 يتم توضيح األهداف الخاصة بالبرنامج التدريبي بشكل واضح ودقيق -2 يتم اختيار البرامج التدريبية وفق احتياجات العاملين والعمل -3 يتم تصميم البرامج التدريبية بمنهجية علمية -4تحرص اإلدارة على التعرف على احتياجات الموظفين التدريبية لتحسين -5

مستوى أدائهم

تحرص اإلدارة على تحديد مواطن ضعف األداء لدى العاملين -6 تبحث اإلدارة عن أسباب الخطأ في األداء وتعمل على التخلص منه -7

ثانيا : كفاءة برامج التدريب ُتعد البرامج التدريبية المنفذة من افضل الوسائل لتحسين اداء العمل -1 ساهم التدريب في تخفيض االعباء المتعلقة بالوظيفة داخل القسم -2 ساهم توظيف الطرق العلمية المتطورة في زيادة كفاءة برامج التدريب -3ساهمت البرامج التدريبية في اكتساب العاملين مهارات ومعارف تم تطبيقها -4

في المؤسسة

ساعدت البرامج التدريبية في امتالك الموظف لروح المنافسة -5 تتيح البرامج التدريبية للعاملين فرصة الممارسة العملية -6 لخلق كوادر متميزةتسهم الشركة بإعداد برامج تدريبية -7


SAS t-test Commands

This handout illustrates how to read in raw data to SAS, set up missing values and create

new variables using transformations and recodes. We illustrate independent samples t-tests,

paired t-tests, and one-sample t-tests.

Read in Raw Data

In the first data step, we read in the raw data using an infile and input statement. We don't

need to tell SAS the column location of each variable, because there is at least one blank

between variables, so we can use a free-format input statement where the variables are

simply listed in the order they appear in the raw data file.

/*Read in the raw data*/

data owen;

infile "owen.dat" ;

input family child age sex race w_rank income_c height weight hemo

vit_c vit_a head_cir fatfold b_weight mot_age b_order

m_height

f_height ;

run;

Create a Permanent Dataset

After reading in the raw data, we create a new permanent SAS dataset in which we set up

missing values and create new variables using recodes and transformations. Note in setting

up the missing value codes, a dot (.) is used for the missing value code and no quotes are

employed, because all of these variables are numeric. Although we used two data steps in

this example, all of this code could have been accomplished in a single data step.

libname b510 "c:\documents and settings\kwelch\desktop\b510";

data b510.owen;

set owen;

if height = 999 then height = .;

if weight = 999 then weight = .;


if vit_a = 99 then vit_a = .;

if head_cir = 99 then head_cir = .;

if fatfold = 99 then fatfold = .;

if b_weight = 999 then b_weight= .;

if mot_age = 99 then mot_age = .;

if b_order = 99 then b_order = .;

if m_height = 999 then m_height=.;

if f_height = 999 then f_height=.;

bwt_g = b_weight*10;

if bwt_g not=. and bwt_g < 2500 then lowbwt=1;

if bwt_g >=2500 then lowbwt=0;

log_fatfold = log(fatfold);

htdiff = f_height - m_height;

bmi = weight /(height/100)**2;

run;

Basic Descriptive Statistics

It is always good practice to check a dataset after you have created it. Proc Means is useful

for numeric variables. Be especially attentive to the number of observations (N) and the

minimum and maximum value for each variable. Check to see that they are reasonable.

/*Simple Descriptive Statistics on all Numeric Variables*/

proc means data=b510.owen;

run;

The MEANS Procedure

The MEANS Procedure


Variable N Mean Std Dev Minimum

Maximum

-----------------------------------------------------------------------------

------

family 1006 4525.11 1634.03 2000.00 7569.00

child 1006 1.3359841 0.5716672 1.0000000 3.0000000

age 1006 44.0248509 16.6610452 12.0000000 73.0000000

sex 1006 1.4890656 0.5001291 1.0000000 2.0000000

race 1006 1.2823062 0.4503454 1.0000000 2.0000000

w_rank 1006 2.2127237 0.9024440 1.0000000 4.0000000

income_c 1006 1581.31 974.2279710 80.0000000 6250.00

height 1001 99.0429570 11.4300111 70.0000000 130.0000000

weight 1000 15.6290800 3.6523446 8.2400000 41.0800000

hemo 1006 12.4606362 1.1578850 6.2000000 24.1000000

vit_c 1006 1.1302187 0.6599121 0.1000000 3.5000000

vit_a 763 36.0380079 8.8951237 15.0000000 78.0000000

head_cir 999 49.3763764 2.0739057 39.0000000 56.0000000

fatfold 993 4.4562941 1.6683194 2.6000000 42.0000000

b_weight 986 325.0517241 59.5162936 91.0000000 544.0000000

mot_age 981 29.2660550 6.2603025 17.0000000 51.0000000

b_order 980 2.9479592 2.1939526 1.0000000 16.0000000

m_height 980 163.7632653 6.3663343 122.0000000 199.0000000

f_height 975 178.2194872 7.3821354 152.0000000 210.0000000

bwt_g 986 3250.52 595.1629357 910.0000000 5440.00

lowbwt 986 0.1075051 0.3099115 0 1.0000000

log_fatfold 993 1.4599658 0.2396859 0.9555114 3.7376696

htdiff 972 14.4218107 8.7834139 -12.0000000 56.0000000

bmi 998 15.8124399 1.6634700 11.0247934 26.2912000

-----------------------------------------------------------------------------------


Descriptives for Subgroups using a Class Statement

A Class statement can be used with Proc Means to get descriptive statistics for subgroups of

cases. You don't have to sort the data when using a class statement.


class sex;

var bwt_g bmi fatfold log_fatfold;

run;

The MEANS Procedure

N

SEX Obs Variable Label N Mean Std Dev

Minimum Maximum

--------------------------------------------------------------------------------------

----------------------

1 514 bwt_g 497 3340.56 565.3268435 1360.00

5170.00

bmi 510 15.8982386 1.6074313 11.3795135 26.2912000

FATFOLD FATFOLD 507 4.2518738 0.9720458 2.6000000 10.2000000

log_fatfold 507 1.4247028 0.2076417 0.9555114 2.3223877

2 492 bwt_g 489 3159.00 611.1350784 910.0000000

5440.00

bmi 488 15.7227732 1.7171565 11.0247934 24.4485835

FATFOLD FATFOLD 486 4.6695473 2.1489049 2.6000000 42.0000000

log_fatfold 486 1.4967524 0.2643232 0.9555114 3.7376696

--------------------------------------------------------------------------------------

Descriptives for Subgroups using a By Statement

A By statement is another way to get information for subgroups of cases. You need to sort

the data first when using a By statment. The By statement is more generally applicable than

the Class statement and can be used with most SAS procedures (e.g. Proc Reg, Proc Freq). To

avoid too much output, use a By statement only for variables that have a limited number of

levels.


proc sort data=b510.owen;

by sex;

run;


by sex;

var bwt_g bmi fatfold log_fatfold;

run;

-------------------------------------------- SEX=1 -----------------------------------

The MEANS Procedure

Variable Label N Mean Std Dev Minimum

Maximum

--------------------------------------------------------------------------------------

bwt_g 497 3340.56 565.3268435 1360.00

5170.00

bmi 510 15.8982386 1.6074313 11.3795135

26.2912000

FATFOLD FATFOLD 507 4.2518738 0.9720458 2.6000000

10.2000000

log_fatfold 507 1.4247028 0.2076417 0.9555114

2.3223877

--------------------------------------------------------------------------------------

-------------------------------------------- SEX=2 -----------------------------------

Variable Label N Mean Std Dev Minimum

Maximum

--------------------------------------------------------------------------------------

bwt_g 489 3159.00 611.1350784 910.0000000

5440.00

bmi 488 15.7227732 1.7171565 11.0247934

24.4485835


FATFOLD FATFOLD 486 4.6695473 2.1489049 2.6000000

42.0000000

log_fatfold 486 1.4967524 0.2643232 0.9555114

3.7376696

--------------------------------------------------------------------------------------

Boxplots

Boxplots are a nice way to visualize data when you wish to compare the value of a

continuous variable for two or more groups. In SAS 9.2, you can use Proc Sgplot to get

boxplots. Proc Boxplot can be used in earlier versions of SAS, and in SAS 9.2.

/*Boxplots*/

proc sgplot data=b510.owen;

vbox bwt_g / category=sex;

run;


vbox bmi / category=sex;

run;


vbox fatfold / category=sex;

run;


vbox log_fatfold / category=sex;

run;

The boxplots show the median, upper and lower quartiles, give an idea of skewness, and

indicate outliers.


Independent Samples t-test

An independent samples t-test can be used to compare the mean of a continuous

variable (e.g., birthweight), for two groups of cases. In this example, we are

comparing the means of BWT_G, WEIGHT, and LOG_FATFOLD for females vs.

males. Notice that Proc ttest uses a class statement for an independent samples t-

test—no sorting of the data is necessary.

The assumptions for the t-test are that the observations are independent (i.e., the

values of individuals are not correlated), that the underlying distribution of the

continuous variable is normal within the two groups, and that the variances in the two

groups are equal. The t-test is robust to departures from the normality assumption, if

the sample size is large (e.g. 50 or more cases). The equality of variances is a more

important assumption. SAS gives a test of equality of variances at the bottom of the t-

test output. If equality of variances is a reasonable assumption, the F-test for equality

of variances will not be significant. We often use a somewhat higher alpha level than

usual for this equality of variances test (e.g., p>.10) to be more conservative (i.e., we


don't want to wrongly assume equal variances, when in fact they are unequal). SAS

produces two different t-test results, the first one assumes equality of variances and

the second one does not. You can choose the test to use based on the results of the

equality of variances test. By default, SAS always reports a two-sided p-value for the

t-test.

proc ttest data=b510.owen;

class sex;

var bwt_g weight log_fatfold;

run;

Variable: bwt_g

SEX N Mean Std Dev Std Err Minimum Maximum

1 497 3340.6 565.3 25.3584 1360.0 5170.0

2 489 3159.0 611.1 27.6365 910.0 5440.0

Diff (1-2) 181.6 588.5 37.4840

SEX Method Mean 95% CL Mean Std Dev 95% CL

Std Dev

1 3340.6 3290.7 3390.4 565.3 532.2

602.8

2 3159.0 3104.7 3213.3 611.1 575.1

652.0

Diff (1-2) Pooled 181.6 108.0 255.1 588.5 563.6

615.7

Diff (1-2) Satterthwaite 181.6 108.0 255.2

Method Variances DF t Value Pr > |t|

Pooled Equal 984 4.84


Variable: bmi


1 510 15.8982 1.6074 0.0712 11.3795 26.2912

2 488 15.7228 1.7172 0.0777 11.0248 24.4486

Diff (1-2) 0.1755 1.6620 0.1052


Std Dev

1 15.8982 15.7584 16.0381 1.6074 1.5145

1.7126

2 15.7228 15.5700 15.8755 1.7172 1.6158

1.8322

Diff (1-2) Pooled 0.1755 -0.0311 0.3820 1.6620 1.5921

1.7383

Diff (1-2) Satterthwaite 0.1755 -0.0314 0.3823


Pooled Equal 996 1.67 0.0958

Satterthwaite Unequal 984.1 1.66 0.0963

Equality of Variances

Method Num DF Den DF F Value Pr > F

Folded F 487 509 1.14 0.1407

Variable: log_fatfold


1 507 1.4247 0.2076 0.00922 0.9555 2.3224

2 486 1.4968 0.2643 0.0120 0.9555 3.7377

Diff (1-2) -0.0720 0.2371 0.0151



Std Dev

1 1.4247 1.4066 1.4428 0.2076 0.1956

0.2213

2 1.4968 1.4732 1.5203 0.2643 0.2487

0.2821

Diff (1-2) Pooled -0.0720 -0.1016 -0.0425 0.2371 0.2271

0.2480

Diff (1-2) Satterthwaite -0.0720 -0.1017 -0.0424


Pooled Equal 991 -4.79


N Mean Std Dev Std Err Minimum Maximum

972 14.4218 8.7834 0.2817 -12.0000 56.0000

Mean 95% CL Mean Std Dev 95% CL Std Dev

14.4218 13.8689 14.9747 8.7834 8.4096 9.1923

DF t Value Pr > |t|

971 51.19



14.4352 13.6374 15.2331 9.0257 8.4958 9.6266

DF t Value Pr > |t|

493 35.55 |t|

477 36.91 | t |).

proc ttest data=b510.owen;

var htdiff;

run;

The TTEST Procedure

Variable: htdiff


N Mean Std Dev Std Err Minimum Maximum

972 14.4218 8.7834 0.2817 -12.0000 56.0000


14.4218 13.8689 14.9747 8.7834 8.4096 9.1923

DF t Value Pr > |t|

971 51.19 |t|

971 -2.05 0.0404


One-sample t-test using Proc Univariate

Proc Univariate can also be used to carry out a one-sample t-test, to get more

information about the distribution of a variable, and to look at a histogram of the

distribution of the variable.

proc univariate data=b510.owen;

var htdiff;

histogram / normal;

run;

The UNIVARIATE Procedure

Variable: htdiff

Moments

N 972 Sum Weights 972

Mean 14.4218107 Sum Observations 14018

Std Deviation 8.78341392 Variance 77.1483601

Skewness 0.31703251 Kurtosis 0.56094005

Uncorrected SS 277076 Corrected SS 74911.0576

Coeff Variation 60.9036833 Std Error Mean 0.28172813

Basic Statistical Measures

Location Variability

Mean 14.42181 Std Deviation 8.78341

Median 15.00000 Variance 77.14836

Mode 15.00000 Range 68.00000

Interquartile Range 12.00000


Tests for Location: Mu0=0

Test -Statistic- -----p Value------

Student's t t 51.19052 Pr > |t| = |M| = |S|


Missing Values

-----Percent Of-----

Missing Missing

Value Count All Obs Obs

. 34 3.38 100.00

Fitted Normal Distribution for htdiff

Parameters for Normal Distribution

Parameter Symbol Estimate

Mean Mu 14.42181

Std Dev Sigma 8.783414

Goodness-of-Fit Tests for Normal Distribution

Test ----Statistic----- ------p Value------

Kolmogorov-Smirnov D 0.07149425 Pr > D W-Sq A-Sq


10.0 3.0000 3.16541

25.0 8.0000 8.49749

50.0 15.0000 14.42181

75.0 20.0000 20.34613

90.0 25.0000 25.67821

95.0 29.0000 28.86924

99.0 37.0000 34.85509

One-sample t-test using Proc Univariate with a specified null hypothesis value for the

mean

We can also specify a null hypothesis value for the mean when using Proc Univariate by

using the mu0 option.

proc univariate data=b510.owen mu0=15;

var htdiff;

run;

Tests for Location: Mu0=15

Test -Statistic- -----p Value------

Student's t t -2.0523 Pr > |t| 0.0404

Sign M -40 Pr >= |M| 0.0071

Signed Rank S -18300 Pr >= |S| 0.0121


SAS Simple Linear Regression Example

This handout gives examples of how to use SAS to generate a simple linear regression plot, check the correlation

between two variables, fit a simple linear regression model, check the residuals from the model, and also shows

some of the ODS (Output Delivery System) output in SAS.

Read in Raw Data

We first read in the raw data from the werner2.dat raw dataset, and set up the missing value codes using a data

step, and then check descriptive statistics for the numeric variables, using Proc Means.

OPTIONS FORMCHAR="|----|+|---+=|-/\*";

libname b510 "C:\Users\kwelch\Desktop\B510";

DATA b510.werner;

INFILE "C:\Users\kwelch\Desktop\B510\werner2.dat";

INPUT ID 1-4 AGE 5-8 HT 9-12 WT 13-16

PILL 17-20 CHOL 21-24 ALB 25-28 1

CALC 29-32 1 URIC 33-36 1;

IF HT = 999 THEN HT = .;

IF WT = 999 THEN WT = .;

IF CHOL = 600 THEN CHOL = .;

IF ALB = 99 THEN ALB = .;

IF CALC = 99 THEN CALC = .;

IF URIC = 99 THEN URIC = .;

run;

/*Check the Data*/

title "DESCRIPTIVE STATISTICS";

proc means data=b510.werner;

run;


DESCRIPTIVE STATISTICS

The MEANS Procedure

Variable N Mean Std Dev Minimum Maximum

-------------------------------------------------------------------------------

ID 188 1598.96 1057.09 3.0000000 3519.00

AGE 188 33.8191489 10.1126942 19.0000000 55.0000000

HT 186 64.5107527 2.4850673 57.0000000 71.0000000

WT 186 131.6720430 20.6605767 94.0000000 215.0000000

PILL 188 1.5000000 0.5013351 1.0000000 2.0000000

CHOL 187 235.1550802 44.5706219 50.0000000 390.0000000

ALB 186 4.1112903 0.3579694 3.2000000 5.0000000

CALC 185 9.9621622 0.4795556 8.6000000 11.1000000

URIC 187 4.7705882 1.1572312 2.2000000 9.9000000

-------------------------------------------------------------------------------

Correlation

We now check the correlation between the response (or dependent) variable, CHOL, and the predictor (or

independent) variable, AGE. It is positive, and significant (r = .369, p


Variable N Mean Std Dev Sum Minimum Maximum

AGE 188 33.81915 10.11269 6358 19.00000 55.00000

CHOL 187 235.15508 44.57062 43974 50.00000 390.00000

Pearson Correlation Coefficients

Prob > |r| under H0: Rho=0

Number of Observations

AGE CHOL

AGE 1.00000 0.36923


Simple Linear Regression

We now fit a linear regression model, with CHOL as the Y (dependent or outcome) variable and AGE as the X

(independent or predictor) variable, using Proc Reg. We first illustrate the most basic Proc Reg syntax, and then

show some useful options. The Quit statement is used to tell SAS that there are no more statements coming for

this run of Proc Reg.

The output shows that there is a positive relationship between these two variables. When age increases by one

year, average cholesterol is predicted to increase by 1.62 units, and this is a significant relationship (t(185) = 5.40,

p


Simple Linear Regression Model with no options

The REG Procedure

Model: MODEL1

Dependent Variable: CHOL

Number of Observations Read 188

Number of Observations Used 187

Number of Observations with Missing Values 1

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 1 50373 50373 29.20 |t|

Intercept 1 179.96174 10.65564 16.89


We now include some diagnostic plots using Proc Reg. We also generate a new dataset called OUTREG1 that

contains all of the original variables, plus the predicted value for each observation (PREDICT), the residual (RESID)

and the studentized-deleted residual (RSTUD), and Cook's Distance (COOKD)..

ods graphics on;

title "Simple Linear Regression with Diagnostic Plots";

proc reg DATA=B510.werner;

MODEL CHOL=AGE / stb clb;

OUTPUT OUT=OUTREG1 P=PREDICT R=RESID RSTUDENT=RSTUDENT COOKD=COOKD;

run;quit;

ods graphics off;

The partial output below shows the standardized estimate (obtained with the STB option), which shows the

estimated change in Y (in standard deviation units) when X is increased by one standard deviation. This estimate

is 0.369. We also see the 95% Confidence limits for the parameter estimate, which are form 1.03 to 2.22.

Parameter Estimates

Parameter Standard Standardized

Variable DF Estimate Error t Value Pr > |t| Estimate

Intercept 1 179.96174 10.65564 16.89


The diagnostic panel shows a series of diagnostic plots for this regression model.

The residual plot below shows a scatterplot with the residuals on the Y-axis and AGE on the X-axis. We want to

look for a lack of pattern in these residuals. We can see that there is one low outlier, at about age 25.


The fit plot shown below shows the regression model fit, and summarizes some of the statistics for the model.

Check the output dataset


We now check the output dataset, using Proc Print. We also request that Proc Print display the labels for the

each variable, by using the Label option. We print selected variables for those observations with the absolute

value of the studentized deleted residuals being greater than or equal to 3, using a Where statement.

title "Partial Listing of Output Dataset";

proc print data=outreg1;

where abs(rstud) >=3;

VAR ID AGE CHOL PREDICT RESID RSTUD COOKD LCL UCL LCLM UCLM;

run;

Partial Listing of Output Dataset

Obs ID AGE CHOL PREDICT RESID RSTUD COOKD LCL UCL LCLM UCLM

4 1797 25 50 220.686 -170.686 -4.32214 0.081802 138.358 303.014 212.698 228.674

182 3134 50 390 261.410 128.590 3.20326 0.094792 178.695 344.126 250.106 272.714

Check the residuals for normality

We now check the studentized residuals for normality, using Proc Univariate. This is similar to the output from

the ODS graphics that was shown in the earlier panel.

title "Checking Residuals for Normality";

proc univariate data=outreg1 PLOT NORMAL;

var rstud;

histogram / normal;

qqplot / normal(mu=est sigma=est);

run;

The residuals appear to be fairly normally distributed, but there is at least one very low outlier, which we

identified earlier, when we checked the values in the output dataset.


Refit the regression model without the cases in question

We now refit the model, but without the two outliers being included, by using a Where statement..

ods graphics on;

title "Rerun the model without two obs";

proc reg data=b510.WERNER;

where id not in (1797, 3134);

model chol=age;

run;quit;

ods graphics off;

We can see the changes in the parameter estimates from the output below.

Checking Residuals for Normality

-4.0 -3.2 -2.4 -1.6 -0.8 0 0.8 1.6 2.4 3.2

0

5

10

15

20

25

30

35

Perc

ent

Studentized Residual without Current Obs

Checking Residuals for Normality

-3 -2 -1 0 1 2 3

-6

-4

-2

0

2

4

Stu

dentized R

esid

ual w

ithout

Curr

ent

Obs

Normal Quantiles


Dependent Variable: CHOL

Number of Observations Read 186

Number of Observations Used 185

Number of Observations with Missing Values 1

Analysis of Variance

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 1 38478 38478 25.82 |t|

Intercept 1 186.70039 9.98091 18.71

sas programming: analyticalstatanalysis.weebly.com/uploads/8/1/4/8/8148217/sas...sas programming:...

Documents