stat 342 - wk 8: continuous data - sfu.cajackd/stat342/lect_wk08.pdf · 2016. 12. 15. · stat 342...

71
Stat 342 - Wk 8: Connuous Data proc iml Loading and saving to datasets proc means proc univariate proc sgplot proc corr Stat 342 Notes. Week 3, Page 1 / 71

Upload: others

Post on 18-Jan-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Stat 342 - Wk 8: Continuous Data

proc iml

Loading and saving to datasets

proc means

proc univariate

proc sgplot

proc corr

Stat 342 Notes. Week 3, Page 1 / 71

Page 2: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

PROC IML - Reading other datasets.

If you want to use an existing dataset as a matrix, rather than setting your own manually, you can do this with the 'use' and 'close' commands, combined with the 'read' command.

'Use' is used to tell the system to bring a certain dataset into memory. 'Close' is to wipe it from memory (useful for keeping things running smooth).

Read is used to take that dataset in memory, and save it into a matrix.

Stat 342 Notes. Week 3, Page 2 / 71

Page 3: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

The general syntax is

use <DATASET>;

read all var <VARIABLE NAMES> into <MATRIX NAME>;

close <DATASET>;

Stat 342 Notes. Week 3, Page 3 / 71

Page 4: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Use (and close) use the same syntax for specifying a dataset as other SAS procedures. The dataset is specificed by libname.dataset, and addition options like (obs = ) can be used to specify which rows to load into memory.

use ds1;

use work.ds1;

use somelib.cars;

use somelib.cars(OBS = 10);

Stat 342 Notes. Week 3, Page 4 / 71

Page 5: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Close is the same, except that the options are not necessary in the close statement.

close ds1;

close work.ds1;

close somelib.cars;

close somelib.cars;

Stat 342 Notes. Week 3, Page 5 / 71

Page 6: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

With the read command, references to entire classes of variable names still work, including _ALL_ , _NUM_ , and _CHAR_ for all variables, numeric ones, and character ones, respectively.

proc iml;

use SASHELP.Cars(OBS=5);

read all var _NUM_ into m1;

read all var _NUM_ into m2[colname=NumericNames];

close SASHELP.Cars;

print m1, m2;

Stat 342 Notes. Week 3, Page 6 / 71

Page 7: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Matrices are saved as SAS datasets with the create command and the append command.

'Create' makes a new SAS data set of whatever library and name you specify. However, the create command alone only makes a blank dataset (with whatever formatting you specify).

'Append' takes data from some matrix in the IML environment and adds (appends) it to the particular dataset,such as the one you just made with 'create'.

Stat 342 Notes. Week 3, Page 7 / 71

Page 8: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

The general syntax is...

create <DATASET> from <MATRIX>;

append from <MATRIX>;

close <DATASET>;

Like when loading, it's a good idea to close your datasets after you're done with them.

Stat 342 Notes. Week 3, Page 8 / 71

Page 9: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Example of saving data, and seeing with with a proc print;

proc iml;

m = {1 2 3 . , 5 6 7 999};

m = t(m);

create newDS from m[colname={"x","y"}];

append from m;

close work.newDS;

quit;

proc print (data=newDS); run;

Stat 342 Notes. Week 3, Page 9 / 71

Page 10: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Readings on proc iml

Textbook source: Pages 62-68 of 'SAS and R'.

Additional source for interest:

SAS Whitepaper 144-2013 - Getting started with the SAS/IML Language, by Rick Wilkin

https://support.sas.com/resources/papers/proceedings13/144-2013.pdf

Stat 342 Notes. Week 3, Page 10 / 71

Page 11: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

I hope IML isn't too fuzzy from two weeks ago

Stat 342 Notes. Week 3, Page 11 / 71

Page 12: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Proc means and proc univariate are used to describe individual variables on their own. Their capabilities are very similar, but...

1. They have different defaults and slightly different syntaxes.

2. Proc univariate can create histograms and boxplots.

Stat 342 Notes. Week 3, Page 12 / 71

Page 13: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Proc summary also operates similarly to proc means and proc univariate, but since...

3. Proc summary can't produce output to the screen (only toanother file or dataset)...

... and its syntax is less intuitive, we won't be covering it.

Stat 342 Notes. Week 3, Page 13 / 71

Page 14: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

The general syntax for proc means is...

proc means data=input_ds <table options>

<statistics of interest>;

<class/by varnames>;

var <numeric variables>;

output <output_ds> <statistic = varnames>;

run;

Stat 342 Notes. Week 3, Page 14 / 71

Page 15: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

The complete list for <table options> is found here:

http://support.sas.com/documentation/cdl/en/proc/61895/HTML/default/viewer.htm#a000146729.htm

Notables include:

missing – When summarizing by groups using a categorical variable, this makes 'missing data' count as a group (shown in document camera example).

Stat 342 Notes. Week 3, Page 15 / 71

Page 16: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Alpha= <'0.05'> – The significance level. Used for confidence interval statistics. The default value is 0.05, whichwill provide 95% confidence intervals.

Maxdec= <2> – The maximum digits of precision after the decimal point. 0 means everything is rounded to the nearestwhole number, 1 means everything is rounded to the nearest 0.1, and so on.

Print / noprint – Produce or ignore on-screen output.

Stat 342 Notes. Week 3, Page 16 / 71

Page 17: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Example 1: What does this do?

proc means data=input_ds alpha=0.10 maxdec=3 missing noprint;

...;

run;

Stat 342 Notes. Week 3, Page 17 / 71

Page 18: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

<statistics of interest> is where we list the statistics we wish to use to describe the variables.

proc means data=input_ds <table options>

<statistics of interest>;

var <numeric variables>;

<class/by varnames>;

output <output_ds> <statistic = varnames>;

run;

Stat 342 Notes. Week 3, Page 18 / 71

Page 19: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

By default the stats computed are n mean std min max

n – The number of observations.

std – The standard deviation of the observations.

Other notables include sum stderr var and...

nmiss – The number of values missing

clm – Confidence limits of the mean. Affected by the alpha level determined earlier, but defaults to the 95% confidence bounds/limits.

Stat 342 Notes. Week 3, Page 19 / 71

Page 20: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Q1 Q3 – The first (lower) and third (upper) quartiles of the data, where 25% and 75% is below these points, respectively.

qrange – Interquartile range, the difference between Q1 and Q3. In other words, the size of the middle half of the data.

Skewness skew – Skewness, or skew. The central third moment. If this is positive, then you have more extreme

Stat 342 Notes. Week 3, Page 20 / 71

Page 21: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

data values above the mean than below. In other terms, the distribution stretches or 'skews' in the positive direction.

kurtosis – The fourth moment, used to describe how peak-like (positive kurtosis), or flat/plateau-like a unimodal distribution like the normal distribution is.

P1 P5 P10 P90 P95 P99 – Percentiles. The points at which1%. 5%, 10%, 90%, 95%, and 99% of the (observed) values for a given variable. See also quartiles, and the median.

Stat 342 Notes. Week 3, Page 21 / 71

Page 22: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

This is the complete list

Stat 342 Notes. Week 3, Page 22 / 71

Page 23: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

var <numeric variables> defines the variables that will be summarized.

proc means data=input_ds <table options>

<statistics of interest>;

var <numeric variables>;

<class/by varnames>;

output <output_ds> <statistic = varnames>;

run;

Stat 342 Notes. Week 3, Page 23 / 71

Page 24: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Example 2: What does this do?

proc means data=input

clm std P1 P5 Q1 median Q3 P95 P99;

...;

run;

Stat 342 Notes. Week 3, Page 24 / 71

Page 25: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Now you can be the meanest of the mean.

Stat 342 Notes. Week 3, Page 25 / 71

Page 26: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

var <numeric variables> defines the variables for which summary statistics are required.

By default, EVERY numeric variable is included in the summary, even ones that wouldn't make sense to take statistics of, like observation numbers, IDs, and phone numbers. This is just like if you had specified _numeric_ in the list of variables.

Stat 342 Notes. Week 3, Page 26 / 71

Page 27: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

The batting dataset, found in your emails and on the websiteof the auithor of SAS and R includes year-by-year records of many different Major League Baseball players.

It's more than 90,000 rows long, and includes thousands of players. We are interested in finding and confirming year-to-year patterns like 'have strike-outs become more common inrecent years'.

A dataset like this is too large to feasibly print to screen. SAS University Edition will only show 100 rows at a time.

Stat 342 Notes. Week 3, Page 27 / 71

Page 28: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

However, we CAN use a procedure like proc means to effectively characterize the data.

(Apologies to everyone that doesn't care about baseball, butits discrete events make it the easiest to analyze of popular sports in North America)

(Go Cubs!)

Stat 342 Notes. Week 3, Page 28 / 71

Page 29: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Loading the batting dataset

proc import

datafile='dir_location\baseball.csv'

out=ds dbms=csv;

delimiter=',';

getnames=yes;

run;

Stat 342 Notes. Week 3, Page 29 / 71

Page 30: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

In the batting dataset, the variables that make the most sense would be games batted (G_batting), and the number of at-bats (AB), runs (R), hits (H), home-runs (HR) walks (BB), intentional walks (IBB), and strike outs (SO).

Therefore, the variable specification line would look like this:

var G_batting AB R H HR BB IBB SO;

Stat 342 Notes. Week 3, Page 30 / 71

Page 31: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Output <out_ds> defines the dataset that the output will be saved to (if any).

<statistic = varnames> defines the variable names for the statistics that you produce.

proc means data=input_ds <table options>

<statistics of interest>;

<class/by varnames>;

var <numeric variables>;

output <out_ds> <statistic = varnames>;

run;

Stat 342 Notes. Week 3, Page 31 / 71

Page 32: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

The format of the statistics and varnames is quite rigid. These work as expected, producing the sum of AB and BB forthe relevant set of observations:

output out = out_ds

sum(AB) = Atbats sum(BB) = Walks;

However, this causes a compiling error:

output out = out_ds

sum(BB, IBB) = Walks2;

Stat 342 Notes. Week 3, Page 32 / 71

Page 33: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Finally, <class/by varnames> is what you use to specify how you wish to group each set of observations.

Without it, you will get the specified statistics of every observation put together.

proc means data=input_ds <table options>

<statistics of interest>;

<class/by varnames>;

var <numeric variables>;

output <out_ds> <statistic = varnames>;

run;

Stat 342 Notes. Week 3, Page 33 / 71

Page 34: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

This code would give you the summary stats of (yearID, AB, SO) for all the data combined.

You'll get the total at-bats and strikeouts of ALL observationsin the data set, and your output will be a single row.

proc means data=batting

sum;

var AB SO;

run;

Stat 342 Notes. Week 3, Page 34 / 71

Page 35: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

This code, however, will give you the total number of at-bats and strikeouts for EACH year.

proc means data=batting

sum;

var AB SO;

class yearID;

run;

Stat 342 Notes. Week 3, Page 35 / 71

Page 36: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Classes are sorted in ascending alphabetical / numeric order.Proc means also gives the number of observations for each class.

No class variable. Class yearID.

Stat 342 Notes. Week 3, Page 36 / 71

Page 37: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

More that one variable can be used as a class variable. If thisis the case, observations will be grouped by each unique combination of the class variables, and aggregated within that group.

Class (and its pre-sorted version, 'by') may take a VERY long time to do this for a large dataset. It is HIGHLY recommended to test multiple class variables by some smaller 'toy' dataset first to avoid locking up your SAS terminal for hours.

Stat 342 Notes. Week 3, Page 37 / 71

Page 38: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

class yearID lgid;

A variable can be one of the variables summarized AND be a class variable at the same time. The results may be

confusing though (or obvious in the case of mean/median).

proc means data=batting

sum mean;

var yearID AB SO;

class yearID lgid;

run;

Stat 342 Notes. Week 3, Page 38 / 71

Page 39: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

The output will look like this as a table.

Stat 342 Notes. Week 3, Page 39 / 71

Page 40: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

...and this if you use the default dataset output. output out=ds_name;

... it looks like this.Stat 342 Notes. Week 3, Page 40 / 71

Page 41: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Next in this series, hypothesis tests!

Stat 342 Notes. Week 3, Page 41 / 71

Page 42: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Before we continue, let's create some derived variables for the batting average, strike-out average, and walking average of each player per year.

Each of these variables is a proportion of the at-bats that a given player experiences in a given year.

data batting;

set batting;

batavg = H / AB;

so_avg = SO / AB;

walk_avg = sum(BB, IBB) / AB;

run;Stat 342 Notes. Week 3, Page 42 / 71

Page 43: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Also, let's limit our dataset to years and players in which at least 300 at-bats were observed. This way our derived variables are proportions of a large number of observations.

data batting_300ab;

set batting;

if AB ge 300 then output;

run;

Stat 342 Notes. Week 3, Page 43 / 71

Page 44: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Consider the distributions of the these three derived variables. Intuitively, they should be normally distributed. Is that the case?

Let's check the histograms with the UNIVARIATE procedure.

Including the 'histogram' statement in 'proc univariate' tells SAS to make histograms of the variables you specify.

If you don't specify any variables, it uses every variable that was listed in the 'var' statement.

Including '/normal' at the end of the histogram statement adds a fitted normal curve to the histograms created.

Stat 342 Notes. Week 3, Page 44 / 71

Page 45: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

The 'var' statement will include every numeric variable in the dataset unless you specify otherwise.

So this code will produce summary statistics and histograms for the three derived variables we made earlier, with fitted normal curves included

proc univariate data=batting_300ab;

var batavg so_avg walk_avg;

histogram / normal;

run;

Stat 342 Notes. Week 3, Page 45 / 71

Page 46: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Stat 342 Notes. Week 3, Page 46 / 71

Page 47: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Stat 342 Notes. Week 3, Page 47 / 71

Page 48: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Stat 342 Notes. Week 3, Page 48 / 71

Page 49: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

We can also check the quantile-quantile (vs normal distribution) plots for these variables, and see how well they fit.

In a quantile-quantile plot, the observations of a variable are sorted and compared to some theoretical quantiles.

For example: Say we have 1000 observations, and we want to compare the distribution of these observations to a normal distribution.

The 0.025 quantile of the normal distribution is 1.96 standard deviations below the mean. If our data is normal, then the 25th lowest data point should also be 1.96 sd below the mean.Stat 342 Notes. Week 3, Page 49 / 71

Page 50: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Quantile-quantile plots can be produced with the 'qqplot' statement in the univariate procedure.

The '/ normal' specifies the distribution. Normal is the default. Other options include 'beta', 'exponential', 'gamma', and 'Weibull'.

proc univariate data=batting_300ab;

var batavg so_avg walk_avg;

qqplot / normal;

run;

Stat 342 Notes. Week 3, Page 50 / 71

Page 51: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

If the selected distribution fits the data, we should see a straight line. (It doesn't have to be perfect at the extreme ends)Stat 342 Notes. Week 3, Page 51 / 71

Page 52: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Skewed distributions will demonstrate a single bend (vs normal).

Stat 342 Notes. Week 3, Page 52 / 71

Page 53: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

More complicated deviations from will exhibit complexbehaviour, such as s-curves or extreme values.

Stat 342 Notes. Week 3, Page 53 / 71

Page 54: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Finally, we can conduct formal tests for normality by using the normal option right after we specify the input dataset.

proc univariate data=batting_300ab normal;

var batavg so_avg walk_avg;

run;

Some normality tests have minimum / maximum sample sizes. Only the appropriate tests will be performed. (Shapiro-Wilks test,for example, won't show for N > 2000)

Stat 342 Notes. Week 3, Page 54 / 71

Page 55: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Batting averages

Strikeout averages

Walking averages Stat 342 Notes. Week 3, Page 55 / 71

Page 56: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Let's go exploring for bivariate trends.

Stat 342 Notes. Week 3, Page 56 / 71

Page 57: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

For continuous data, we can look at two variables at a time in several ways.

- Correlation

- Scatterplots

- Regression

For scatterplots, we can use the SGPLOT procedure, which standard for Statistical Graphic PLOT.

Stat 342 Notes. Week 3, Page 57 / 71

Page 58: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Proc sgplot is the basic procedure for a wide variety of graphing options. These include:

Scatterplots with the scatter statement.

Bar plots with the hbar and vbar statements.

Box plots with the hbox and vbox statements.

Fitted regression lines with the reg statement.

Locally fitted spline-smoothed fit with the loess statement.

Kernel-based density curves with the density statement.

Stat 342 Notes. Week 3, Page 58 / 71

Page 59: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

For a scatterplot between strikeout average and batting (hitting) average, the code is:

proc sgplot data=batting_300ab;

scatter x=batavg y=so_avg;

run;

Stat 342 Notes. Week 3, Page 59 / 71

Page 60: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Stat 342 Notes. Week 3, Page 60 / 71

Page 61: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

We can draw a 99% prediction ellipse around these data points with the 'ellipse' statement.

The default is a 95% level for confidence/prediction, including forthe ellipse, so we decrease alpha in the settings in order to make an ellipse that's farther out and more visible.

proc sgplot data=batting_300ab;

scatter x=batavg y=so_avg;

ellipse x=batavg y=so_avg / alpha = 0.01;

run;

Stat 342 Notes. Week 3, Page 61 / 71

Page 62: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Stat 342 Notes. Week 3, Page 62 / 71

Page 63: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

There are tons of data points (about 19,700), so it's hard to get a sense of the relative density of data points. We can fix that with some graphical options.

proc sgplot data=batting_300ab;

scatter x=batavg y=so_avg /transparency = 0.9;

ellipse x=batavg y=so_avg / alpha = 0.01;

run;

Stat 342 Notes. Week 3, Page 63 / 71

Page 64: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Stat 342 Notes. Week 3, Page 64 / 71

Page 65: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

We can do something more formal with the correlation procedure, PROC CORR.

For example, we can get the simple pairwise (Pearson) correlation between all three of our derived variables with the following:

proc corr data=batting_300ab;

var batavg so_avg walk_avg;

run;

Stat 342 Notes. Week 3, Page 65 / 71

Page 66: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

This will produce some simple summary stats, and the correlationcoefficient, p-value against 0 correlation (two-tailed), and number of observation pairs for each pair of variables.

Stat 342 Notes. Week 3, Page 66 / 71

Page 67: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

We can look much deeper into this with alternative correlation measures, like Spearman's Rank Sum (ideal for non-linear, monotonic relationships), or Kendall's Tau.

proc corr data=batting_300ab

Pearson Spearman Kendall;

var batavg so_avg walk_avg;

run;

Stat 342 Notes. Week 3, Page 67 / 71

Page 68: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

This will produce equivalent matrices for the other measures.

If you still want the Pearson correlation as well, you have to specify it.

Stat 342 Notes. Week 3, Page 68 / 71

Page 69: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

We can also produce scatterplots and histograms of each variableand pair of variables in a neatly-arranged matrix.

For large datasets, you may need to get around the default limit of the number of data points to draw with the maxpoints option in the plot() settings.

proc corr data=batting_550ab

plots=matrix(histogram) plots(maxpoints = 50000);

var batavg so_avg walk_avg;

run;

Stat 342 Notes. Week 3, Page 69 / 71

Page 70: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Stat 342 Notes. Week 3, Page 70 / 71

Page 71: Stat 342 - Wk 8: Continuous Data - SFU.cajackd/Stat342/Lect_Wk08.pdf · 2016. 12. 15. · Stat 342 - Wk 8: Continuous Data proc iml Loading and saving to datasets proc means proc

Readings on week 8 material

Textbook source: Chapter 5 'SAS and R'.

Sources for your interest:

The Essential Meaning of PROC MEANS http://www2.sas.com/proceedings/sugi26/p064-26.pdf

Readings on week 9 material (next week)

Textbook source: Chapter 6 'SAS and R'.

Stat 342 Notes. Week 3, Page 71 / 71