chapter 2 - describing data 1.summary statistics - proc ... · chapter 2 - describing data...

22
Chapter 2 - Describing Data 1. Summary Statistics - Proc Means (a) Var (b) Title (c) Class (d) By (e) Output 2. More Statistics and Plots - Univariate 3. Proc Sort This covers sections: 2.A-H. You should also read section 19I. 1

Upload: phamcong

Post on 22-Apr-2018

225 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Chapter 2 - Describing Data

1. Summary Statistics - Proc Means

(a) Var

(b) Title

(c) Class

(d) By

(e) Output

2. More Statistics and Plots - Univariate

3. Proc Sort

This covers sections: 2.A-H. You should also read section

19I.

1

Page 2: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Creating a SAS data set: Example

/* Population, population density, births and deaths for

Western European countries, 1995 */

DATA EUROPE_W; /* this creates a SAS Data Set called EUROPE_W */

/* Source: Organisation for Economic Co-op. and Devel. Labour

Force Stat., 1976-1996, Paris, 1997 Ed.*/

INPUT COUNTRY $ POP DENSITY BRATE DRATE;

/* POP = population in 1000’s, DENSITY = 1000’s of

residents/km^2 BRATE, DRATE = birth, death rate per 1000 */

DATALINES;

Austria 8047 95.9 . .

Belgium 10137 332.4 . .

Denmark 5228 121.3 13.4 12.0

Finland 5108 15.1 12.3 9.6

France 58143 105.9 12.5 9.1

2

Page 3: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Creating a SAS Data Set: Example

Germany 81661 228.8 9.4 10.8

Greece 10454 79.2 9.7 9.6

Iceland 267 2.61 7.2 6.0

Ireland 3598 51.2 . .

Italy 57283 190.2 . .

Luxembourg 413 158.8 13.2 9.3

Netherlands 15459 378.91 2.3 8.8

Norway 4348 13.4 13.8 10.3

Portugal 9918 107.3 10.8 10.5

Spain 39210 77.7 9.2 8.7

Sweden 8847 19.7 11.6 10.6

Switzerland 7062 171.0 11.6 8.9

UK 58606 239.4 12.5 11.0

;

3

Page 4: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Questions of interest:

1. How many missing birth rates are in our sample?

2. What is the mean population density?

3. How variable is population density from country to coun-

try?

4. What is the distribution of population? population den-

sity?

4

Page 5: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Another SAS Data Set: Infile and Input

• The file snails.txt contains data from an experiment

in which groups of 20 snails were held for periods of

1, 2, 3 or 4 weeks in carefully controlled conditions of

temperature and relative humidity.

• There were two species of snail: A and B.

• At the end of the exposure time the snails were tested

to see if they had survived; the process itself is fatal for

the animals.

• Using the INFILE and INPUT statements, the data can be

read into a SAS data set called SNAILS.

Species Time Humidity Temperature Fatalities N

A 1 60.0 10 0 20

A 1 60.0 15 0 20

...........................................

B 4 75.8 20 7 20

5

Page 6: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Questions of interest:

1. What is the mean and standard deviation of the num-

ber of fatalities of species B for each level of exposure

(TIME)?

2. What is the distribution of the number of fatalities?

3. What is an approximate 95% confidence interval for the

mean number of fatalities?

4. How many times did 0 fatalities occur?

6

Page 7: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Proc Means

• Syntax:

PROC MEANS DATA = SASdata options;

(optional statements)

• Explanation:

– the DATA option specifies a SAS data set. If this

option is not used, SAS looks to the most recently

created or used SAS data set.

– Examples:

PROC MEANS DATA = EUROPE_W;

PROC MEANS DATA = SNAILS;

PROC MEANS;

7

Page 8: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Optional Statements for Proc Means

• To compute specific kinds of statistics, use e.g. N,

NMISS, MEAN, STD, STDERR, CLM, MIN, MAX,

SUM, VAR, CV, SKEWNESS, KURTOSIS, T, and MAXDEC=n.

• An additional option is the NOPRINT option which sup-

presses printing of output in the Output Window.

PROC MEANS DATA = EUROPE_W NMISS MEAN STD

VAR MAXDEC=4;

gives the number of missing observations for each vari-

able in the SAS data set EUROPE_W, as well as the mean,

standard deviation and variance. The MAXDEC option

restricts the number of decimal places to 4.

• A number of types of optional statements can be used,

including a TITLE , VAR , CLASS, BY and OUTPUT statement.

8

Page 9: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Subcommand statements for Proc Means

• The TITLE statement is useful for preparing reports.

• The VAR statement specifies which variables the sum-

mary statistics should be computed for.

Example:

PROC MEANS DATA = EUROPE_W NMISS MEAN STD VAR;

TITLE ’Demographic Statistics for Western Europe’;

VAR DENSITY BRATE DRATE;

9

Page 10: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Subgrouping with the Class Statement

• The CLASS statement is used when we require computa-

tion of the various summary statistics for different sub-

groups of classes. For example, to estimate the mean

number of fatalities for each of the two species of snail,

we use SPECIES as a class variable:

• Example:

DATA SNAILS;

INFILE ’snails.txt’;

INPUT SPECIES $ TIME HUMIDITY TEMP FATALITY N;

PROC MEANS DATA=SNAILS MEAN;

TITLE ’Mean Fatalities For Each Species of Snail’;

VAR FATALITY;

CLASS SPECIES;

RUN;

QUIT;

10

Page 11: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Subgrouping with Class

• After execution, the Output window contains the two

averages:

Mean Fatalities For Each Species of Snail

A 0.708333

B 4.020833

• We are actually interested in the mean number of fa-

talities for each type of snail at each level of exposure

(TIME). Thus, TIME is a second classification variable,

nested within the first classification variable SPECIES.

• We can obtain all of the required averages, as well as

95% confidence limits for the true mean in each case,

by employing the following:

PROC MEANS DATA=SNAILS MEAN CLM;

TITLE ’Mean Fatalities For Each Species of Snail’;

VAR FATALITY;

CLASS SPECIES TIME;

11

Page 12: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Subgrouping with the By Statement

• The BY statement is almost interchangeable with the

CLASS statement. However, it will only work when the

data set is sorted according to the BY variable. The

CLASS statement does not have this restriction.

• Example:

PROC MEANS DATA=SNAILS MEAN CLM;

TITLE ’Mean Fatalities For Each Species of Snail’;

VAR FATALITY;

BY SPECIES TIME;

• This works since SPECIES and TIME are already sorted.

For each value of SPECIES the variable TIME is sorted.

The CLASS statement uses more memory than BY, but

the BY will tend to be slower than CLASS, since sorting is

a slow operation. These differences are only noticeable

for large data sets.

12

Page 13: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Using Output from Proc Means

• The OUTPUT statement is used to create a new SAS data

set consisting of the summary statistic computed by

PROC MEANS.

• Example 1: The following creates a new SAS data set

called SNAILSUM which will contain 2 observations (one

for each species) on the 3 variables M_FATAL, S_FATAL,

and V_FATAL.

PROC MEANS DATA=SNAILS MEAN STD VAR NOPRINT;

VAR FATALITY;

CLASS SPECIES;

OUTPUT OUT=SNAILSUM

MEAN=M_FATAL

STD =S_FATAL

VAR =V_FATAL;

13

Page 14: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Output: Another Example

• The following creates a SAS data set consisting of a sin-

gle observation on the two variables M_BRATE and M_DRATE.

The number of variables in the VAR statement must

match the number of variables created by the OUTPUT

statement, for each statistic listed in the options.

PROC MEANS DATA=EUROPE_W MEAN;

VAR BRATE DRATE;

OUTPUT OUT=EUROPSUM

MEAN=M_BRATE M_DRATE;

• These new SAS data sets can later be used by SAS

procedures, if desired.

14

Page 15: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Proc Means: Example

• Here we plot a histogram of the averages of the num-

bers of fatalities. Note that we have used the NOPRINT

option here to suppress output to the Output window.

PROC MEANS DATA=SNAILS MEAN NOPRINT;

TITLE ’Mean Fatalities For Each Species of Snail’;

VAR FATALITY;

CLASS SPECIES TIME;

OUTPUT OUT = SNAILSUM;

MEAN = M_FATAL;

PROC CHART DATA=SNAILSUM;

VBAR M_FATAL;

15

Page 16: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

PROC UNIVARIATE

• Syntax:

PROC UNIVARIATE DATA = SASdata options;

statements;

• Many of the options are the same as for PROC MEANS.

Some additional ones are available: see page 27 of the

textbook. The default output is quite extensive and

includes the median and quartiles, the extreme per-

centiles, and lowest and highest 5 observations. These

last are useful for ensuring that the data has been read

in sensibly.

• The NORMAL option gives a crude normal QQ plot.

– an informal, yet useful, test of normality.

– it is a plot of the ordered observations versus the

expected value of ordered normal observations

– If the plot is close to a straight line, then the data

are approximately normally distributed. Otherwise,

the data are likely non-normal

16

Page 17: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Normal QQ Plot: Example

• This checks whether the distribution of Western Euro-

pean population densities are approximately normal.

PROC UNIVARIATE DATA=EUROPE_W NORMAL;

VAR DENSITY;

• To train your eye to recognize typical departures from

non-normality, simulation of normal and non-normal data

sets having various sample sizes is helpful:

DATA _NULL_;

FILE ’normal.dat’;

N = 20;

DO I=1 TO N;

X = RANNOR(0);

PUT X;

END;

RUN; QUIT;

17

Page 18: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Normal QQ Plotting

• Now, construct the normal QQ plot:

DATA NORTEST;

INFILE ’normal.dat’;

INPUT X;

PROC UNIVARIATE NORMAL;

VAR X;

RUN; QUIT;

• Repeating this for a number of different simulation runs

will give you a good notion as to what the normal QQ

plot should look like.

18

Page 19: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Normal QQ Plotting of Non-Normal Data

• To see what a normal QQ plot shouldn’t look like, try

something like the following:

DATA _NULL_;

FILE ’normal.dat’;

N = 20;

DO I=1 TO N;

U = UNIFORM(0);

IF U < .8 THEN X = RANNOR(0);

ELSE X = 5*RANNOR(0);

PUT X;

END;

RUN; QUIT;

or

19

Page 20: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

Normal QQ Plots of Non-Normal Data

• DATA _NULL_;

FILE ’normal.dat’;

N = 20;

DO I=1 TO N;

X = RANEXP(0);

PUT X;

END;

RUN; QUIT;

• In each case, create the normal QQ plot to see what

happens when the data is really not normally distributed.

20

Page 21: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

The Plot options and Proc Means

• Crude stem-and-leaf and boxplots can be produced us-

ing the PLOT option.

• Most of the statements that can be used with PROC MEANS

can be used with PROC UNIVARIATE. The exception is the

CLASS statement. You must make sure the data are

sorted properly and use the BY statement instead.

21

Page 22: Chapter 2 - Describing Data 1.Summary Statistics - Proc ... · Chapter 2 - Describing Data 1.Summary Statistics - Proc Means (a)Var (b)Title (c)Class (d)By (e)Output 2.More Statistics

PROC SORT

• Syntax

PROC SORT DATA=SASdata;

BY var1 var2 ... ;

Example 1:

PROC SORT DATA = EUROPE_W;

BY DENSITY;

The SAS data set then becomes

Country POP DENSITY BRATE DRATE

Iceland 267 2.61 7.2 6.0

Norway 4348 13.40 13.8 10.3

Finland 5108 15.10 12.3 9.6

................................

Netherlands 15459 378.91 2.3 8.8

The following sorts the data set so that DENSITY appears

in reverse order.

PROC SORT DATA = EUROPE_W;

BY DESCENDING DENSITY;

22