EPIB 698D Lecture 2
Raul Cruz Spring 2013
2
SAS functions• SAS has over 400 functions, with the following general form:
Function-name (argument, argument, …)
• All functions must have parentheses even if they don’t require any arguments
• Example: X=Int(log(10)); Mean_score = mean(score1, score2, score3); The Mean function returns mean of non-missing arguments, which differs
from simply adding and dividing by their number, which would return a missing values if any arguments are missing
3
Common Functions And Operators
Functions ABS: absolute value EXP: exponential LOG: natural logarithm MAX and MIN: maximum and minimum SQRT: square root SUM: sum of variables
Example: SUM (of x1-x10, x21)
• Arithmetic: +, -, *, /, ** (not ^)
4
More SAS functions
Function Name Example Result
Max Y=Max(1, 3, 5); Y=5
Round Y=Round (1.236, 2); Y=1.24
Sum Y=sum(1, 3, 5); Y=9
Length a=‘my cat’; Y=Length (a);
Y=6
Trim a=‘my ’, b=‘cat’Y=trim(a)||b
Y=‘mycat’
5
Using IF-THEN statement
• IF-THEN statement is used for conditional processing. Example: you want to derive means test scores for female students but not male students. Here we derive means conditioning on gender =‘female’
• Syntax: If condition then action; Eg: If gender =‘F’ then mean_score =mean(scr1, scr2);
6
Using IF-THEN statement
Logical comparison Mnemonic term symbol
Equal to EQ =
Not equal to NE ^= or ~=
Less than LT <
Less than or equal to LE <=
Greater than GT >
greater than or equal to GE >=
Equal to one in a list IN
List of Logical comparison operators
Note: Missing numeric values will be treated as the most negative values you can reference on your computer
7
Using IF-THEN statement• Example: We have data contains the following information
of subjects: Age Gender Midterm Quiz FinalExam
21 M 80 B- 8220 F 90 A 9335 M 87 B+ 8548 F 80 C 7659 F 95 A+ 9715 M 88 C 93
• Task: To group student based on their age (<20, [20-40), [40-60), >=60)
data conditional;input Age Gender $ Midterm Quiz $2. FinalExam;datalines;21 M 80 B- 8220 F 90 A 9335 M 87 B+ 8548 F 80 C 7659 F 95 A+ 9715 M 88 C 93;data new1;set conditional;if Age < 20 then AgeGroup = 1;if 20 <= Age < 40 then AgeGroup = 2;if 40 <= Age < 60 then AgeGroup = 3;if Age >= 60 then AgeGroup = 4;Run;
8
9
Multiple conditions with AND and OR
• IF condition1 and condition2 then action;• Eg: If age <40 and gender=‘F’ then group=1;If age <40 or gender=‘F’ then group=2;
10
IF-THEN statement, multiple conditions
• Example: We have data contains the following information of subjects: Age Gender Midterm Quiz FinalExam
21 M 80 B- 8220 F 90 A 9335 M 87 B+ 8548 F 80 C 7659 F 95 A+ 9715 M 88 C 93
• Task: To group student based on their age (<40, >=40),and gender
11
data new1;set conditional;If age <40 and gender='F' then group=1;If age >=40 and gender='F' then group=2;IF age <40 and gender ='M' then group=3;IF age >=40 and gender ='M' then group=4;run;
• Note: Missing numeric values will be treated as the most negative values you can reference on your computer
• Example: group age into age groups with missing values
21 M 80 B- 8220 F 90 A 93. M 87 B+ 8548 F 80 C 7659 F 95 A+ 97. M 88 C 93
12
13
IF-THEN statement, with multiple actions
• Example: We have data contains the following information of subjects: Age Gender Midterm Quiz FinalExam
21 M 80 B- 8220 F 90 A 9335 M 87 B+ 8548 F 80 C 7659 F 95 A+ 9715 M 88 C 93
• Task: To group student based on their age, and assign test date based on the age group
14
Multiple actions with Do, end
• Syntax: IF condition then do; Action1 ;Action 2;End;
If age <=20 then do ;group=1;exam_date =“Monday”;
End;
15
IF-THEN/ELSE statement• SyntaxIF condition1 then action1;Else if condition2 then action2;Else if condition3 then action3;
• IF-THEN/Else statement has two advantages than IF-THEN statement
(1) It is more efficient, use less computing time(2) Else logic ensures that your groups are mutually
exclusive so that you do not put one obervation into more than one groups.
16
IF-THEN/ELSE statement
data new1;set conditional; if Age < 20 then AgeGroup = 1; else if Age >= 20 and Age < 40 then AgeGroup = 2; else if Age >= 40 and Age < 60 then AgeGroup = 3; else if Age >= 60 then AgeGroup = 4;run;
17
Subsetting your data
• You can subset you data using a IF statement in a data step
• Example:
Data new1;Set new;If gender =‘F’;
Data new1;Set new;If gender ^=‘F’ then delete;
18
Stacking data sets using the SET statement
• With more than one data, the SET statement stacks the data sets one on top of the other
• Syntax: DATA new-data-set;SET data-set-1 data-set-2 … data-set-n;
• The Number of observations in the new data set will equal to the sum of the number of observations in the old data sets
• The order of observations is determined by the order of the list of old data sets
• If one of the data set has a variables not contained in the other data sets, then observations from the other data sets will have missing values for that variable
19
Stacking data sets using the SET statement
• Example: Here is data set contains information of visitors to a park. There are two entrances: south entrance and north entrance. The data file for the south entrance has an S for south, followed by the customers pass numbers, the size of their parties, and ages. The data file for the north entrance has an N for north, the same data as the south entrance, plus one more variable for parking lot.
/* North.dat */
N 21 5 41 1N 87 4 33 3N 65 2 67 1 N 66 2 7 1
/* South .dat */
S 43 3 27S 44 3 24 S 45 3 2
20
DATA southentrance;INPUT Entrance $ PassNumber PartySize Age;cards;S 43 3 27 S 44 3 24 S 45 3 2 ; run; DATA northentrance;INPUT Entrance $ PassNumber PartySize Age Lot;Cards;N 21 5 41 1 N 87 4 33 3 N 65 2 67 1 N 66 2 7 1; run; DATA both;SET southentrance northentrance;RUN;
21
Combining data sets with one-to-many match• One-to-many match: matching one observation from one
data set with more than one observation to another data set
• The statement of one-to-many match is the same as one-to-one match
DATA new-data-set;Merge data-set-1 data-set-2;By variable-list;
• The data sets must be sorted first by the BY variables• If the two data sets have variables with the same names,
besides the BY variables, the variables from the second data set will overwrite any variables with the same name in the first data set
22
Example: Shoes data
• The shoe store is putting all its shoes on sale. They have two data file, one contains information about each type of shoe, and one with discount information. We want to find out new price of the shoes
Shoe data:
Max Flight running 142.99Zip Fit Leather walking 83.99Zoom Airborne running 112.99Light Step walking 73.99Max Step Woven walking 75.99Zip Sneak c-train 92.99
Discount data
c-train .25running .30walking .20
23
DATA regular;INFILE datalines dsd;length style $15;INPUT Style $ ExerciseType $ RegularPrice @@;datalines;Max Flight , running, 142.99, …;PROC SORT DATA = regular;BY ExerciseType;
DATA discount;INPUT ExerciseType $ Adjustment @@; cards;c-train .25 …;DATA prices;MERGE regular discount;BY ExerciseType;NewPrice = ROUND(RegularPrice - (RegularPrice * Adjustment), .01);RUN;
24
Simplifying programs with Arrays• SAS Arrays are a collection of elements (usually SAS
variables) that allow you to write SAS statements referencing this group of variables.
• Arrays are defined using Array statement as: ARRAY name (n) variable list
name: is a name you give to the array n: is the number of variables in the array
eg: ARRAY store (4) macys sears target costco
Store(1) is the variable for macysStore(2) is the variable for sears
25
Simplifying programs with Arrays• A radio station is conducting a survey asking people to rate 10
songs. The rating is on a scale of 1 to 5, with 1=Do not like the song; 5-like the song;
• IF the listener does not want to rate a song, he puts a “9” to indicate missing values
• Here is the data with location, listeners age and rating for 10 songs
Albany 54 4 3 5 9 9 2 1 4 4 9Richmond 33 5 2 4 3 9 2 9 3 3 3Oakland 27 1 3 2 9 9 9 3 4 2 3Richmond 41 4 3 5 5 5 2 9 4 5 5Berkeley 18 3 4 9 1 4 9 3 9 3 2
• We want to change 9 to missing values (.)
26
Simplifying programs with Arrays
DATA songs;INFILE ‘E:\radio.txt';INPUT City $ 1-15 Age domk wj hwow simbh kt aomm libm tr filp ttr;ARRAY song (10) domk wj hwow simbh kt
aomm libm tr filp ttr;DO i = 1 TO 10; IF song(i) = 9 THEN song(i) = .;END;run;
27
Using shortcuts for lists of variable names
• When writing SAS programs, we will often need to write a list of variables names. When you have a data will many variables, a shortcut for lists of variables names is helpful
• Numbered range list: variables which starts with same characters and end with consecutive number can be part of a numbered range list
• Eg : INPUT cat8 cat9 cat10 cat11INPUT cat8 – cat11
28
Using shortcuts for lists of variable names
• Name range list: name range list depends on the internal order, or position of the variables in a SAS dataset. This is determined by the appearance of the variables in the DATA step.
• Eg : Data new; Input x1 x2 y2 y3; Run;
• Then the internal range list is: x1 x2 y2 y3• Shortcut for this variable list is x1-y3; • Proc contents procedure with the POSITION option can be
used to find out the internal order
29
Using shortcuts for lists of variable names
DATA songs;INFILE ‘E:\radio.txt';INPUT City $ 1-15 Age domk wj hwow simbh kt aomm libm tr filp ttr;ARRAY new (10) Song1 - Song10;ARRAY old (10) domk -- ttr; DO i = 1 TO 10; IF old(i) = 9 THEN new(i) = .; ELSE new(i) = old(i); END;AvgScore = MEAN(OF Song1 - Song10);run;
30
Sorting, Printing and Summarizing Your Data
• SAS Procedures (or PROC) perform specific analysis or function, produce results or reports
• Eg: Proc Print data =new; run;• All procedures have required statements, and most have
optional statements• All procedures start with the key word “PROC”, followed
by the name of the procedure, such as PRINT, or contents• Options, if there are any, follow the procedure name• Data=data_name options tells SAS which dataset to use as
an input for this procedure. NOTE: if you skip it, SAS will use the most recently created dataset, which is not necessary the same as the mostly recently used data.
31
BY statement
• The BY statement is required for only one procedure, Proc sort
PROC Sort data = new;By gender;Run;
• For all the other procedures, BY is an optional statement, and tells SAS to perform analysis for each level of the variable after the BY statement, instead of treating all subjects as one group
Proc Print data =new;By gender;Run;
• All procedures, except Proc sort, assumes you data are already sorted by the variables in your BY statement
32
PROC Sort • Syntax
Proc Sort data =input_data_name out =out_data_name ;By variable-1 … variable-n;
• The variables in the by statement are called by variables.• With one by variable, SAS sorts the data based on the
values of that variable• With more than one variable, SAS sorts observations by
the first variable, then by the second variable within the categories of the first variable, and so on
• The DATA and OUT options specify the input and output data sets. Without the DATA option, SAS will use the most recently created data set. Without the OUT statement, SAS will replace the original data set with the newly sorted version
33
PROC Sort • By default, SAS sorts data in ascending order, from the
lowest to the highest value or from A to Z. To have the the ordered reversed, you can add the keyword DESCENDING before the variable you want to use the highest to the lowest order or Z to A order
• The NODUPKEY option tells SAS to eliminate any duplicate observations that have the same values for the BY variables
34
PROC Sort • Example: The sealife.txt contains information on the average length in
feet of selected whales and sharks. We want to sort the data by the family and length
Name Family Lengthbeluga whale 15whale shark 40basking shark 30gray whale 50mako shark 12sperm whale 60dwarf shark .5whale shark 40humpback . 50blue whale 100killer whale 30
35
PROC Sort • Example: The sealife.txt contains information on the average length in
feet of selected whales and sharks. We want to sort the data by the family and length
Name Family Lengthbeluga whale 15whale shark 40basking shark 30gray whale 50mako shark 12sperm whale 60dwarf shark .5whale shark 40humpback . 50blue whale 100killer whale 30
36
PROC SortDATA marine;INFILE ‘E:\Sealife.txt';INPUT Name $ Family $ Length;run;
* Sort the data;PROC SORT DATA = marine OUT = seasort
NODUPKEY;BY Family DESCENDING Length;run;
37
Summarizing you data with PROC MEANS• The proc means procedure provide simple statistics on
numeric variables. Syntax: Proc means options ;• List of simple statistics can be produced by proc means:
MAX: the maximum valueMIN: the minimum valueMEAN: the meanN : number of non-missing valuesSTDDEV: the standard deviationNMISS: number of missing valuesRANGE: the range of the dataSUM: the sumMEDIAN: the median
DEFAULT
38
Proc means
• Options of Proc means:
By variable-list : perform analysis for each level of the variables in the list. Data needs to be sorted first Class variable-list: perform analysis for each level of the variables in the list. Data do not need to be sorted Var variable list: specifies which variables to use in the analysis
39
Proc means• A wholesale nursery is selling garden flowers, they want to
summarize their sales figures by month. The data is as follows:
ID Date Lily SnapDragon Marigold756-01 05/04/2001 120 80 110756-01 05/14/2001 130 90 120834-01 05/12/2001 90 160 60834-01 05/14/2001 80 60 70901-02 05/18/2001 50 100 75834-01 06/01/2001 80 60 100756-01 06/11/2001 100 160 75901-02 06/19/2001 60 60 60756-01 06/25/2001 85 110 100
40
DATA sales; INFILE ‘E:\Flowers.txt'; INPUT CustomerID $ @9 SaleDate MMDDYY10. Lily
SnapDragon Marigold; Month = MONTH(SaleDate); PROC SORT DATA = sales; BY Month; * Calculate means by Month for flower sales;PROC MEANS DATA = sales; BY Month; VAR Lily SnapDragon Marigold; TITLE 'Summary of Flower Sales by Month';RUN;
Proc GCHART for bar charts• Example: A bar chart showing the distribution of blood
types from the Blood data set
/* The blood.txt data contain information of 1000 subjects. The variables include: subject ID, gender, blood_type, age group, red blood cell count, white blood cell count, and cholesterol.
DATA blood;INFILE ‘C:\blood.txt';INPUT ID Sex $ BloodType $ AgeGroup $ RBC WBC Cholesterol;run;
title "Distribution of Blood Types";
proc gchart data=blood; vbar BloodType;
run;
Proc GCHART for bar charts
• VBAR: request a vertical bar chart for the variable • Alternatives to VBAR are as follows:
HBAR: horizontal bar chartVBAR3D: three-dimensional vertical bar chartHBAR3D: three-dimensional horizontal bar chartPIE: pie chartPIE3D: three-dimensional pie chartDONUT: donut chart
A Few Options
proc gchart data=blood;vbar bloodtype/space=0 type=percent ;
run;
Controls spacing between bars
Changes the statistic from frequencyto percent
Type option• Type =freq : displays frequencies of a categorical variable• Type =pct (Percent): displays percent of a categorical
variable• Type =cfreq : displays cumulative frequencies of a
categorical variable• Type =cpct (cPercent): displays cumulative percent of a
categorical variable
Basic Output
This value of 7,000corresponds to a
class ranging from6500 to 7500
(with a frequencyof about 350)
SAS computes midpoints of each bar automatically. You can change it by supplying your own midpoints: vbar RBC / midpoints=4000 to 11000 by 1000;
Creating charts with values representing categories
• SAS places continuous variables into groups before generating a frequency bar chart
• If you want to treat the values as discrete categories, you can use DISCRETE option
• Example: create bar chart showing the frequencies by day of the week for the visit to a hospital
libname d “C:\”;data day_of_week; set d.hosp; Day = weekday(AdmitDate);run;
*Program Demonstrating the DISCRETE option of PROC GCHART;title "Visits by Month of the Year";proc gchart data =day_of_week; vbar Day / discrete;run;
The Discrete Optionproc gchart data= day_of_week;
vbar day /discrete;run;
Discrete establishes each distinctvalue of the midpoint variable asa midpoint on the graph. If the
variable is formatted, the formattedvalues are used for the construction.
If you use discrete witha numeric variable you
should:1. Be sure it has only a
few distinct values.or
2. Use a format to makecategories for it.
Summary Variables
• If I want my bar chart to summarize values of some analysis variable for each midpoint, use the sumvar= (and type= ) option.
• sumvar= variable name • Type =mean: displays mean of a continuous
variable• Type =sum: displays totals of a continuous variable
( this is default value)
Creating bar charts representing sums• The GCHART procedure can be used to create bar charts where the
height of bars represents some statistic, means (or sums) for example, for each value of a classification variable
• Example: Bar chart showing the sum of the Totalsales for each region of the country
title "Total Sales by Region";proc gchart data=d.sales;
vbar Region / sumvar=TotalSales type=sum ; format TotalSales dollar8.;
run;
Creating bar charts representing means
proc gchart data=blood; vbar Gender / sumvar=cholesterol type=mean;run;quit;
GPLOT• The GPLOT procedure plots the values of
two or more variables on a set of coordinate axes (X and Y).
• The procedure produces a variety of two-dimensional graphs including– simple scatter plots – overlay plots in which multiple sets of data
points display on one set of axes
Procedure Syntax: PROC GPLOT
• PROC GPLOT; PLOT y*x </option(s)>;
run;
• Example: plot of systolic blood pressure (SBP) by diastolic blood pressure (DBP)
title "Scatter Plot of SBP by DBP";proc gplot data=d.clinic;
plot SBP * DBP;run;
• Multiple plots can be made in 3 ways:
(1)proc gplot; plot y1*x y2*x /overlay; run; plots y1 versus x and y2 versus x using the same horizontal and vertical axes.
(2) proc gplot; plot y1*x; plot2 y2*x; run; plots y1 versus x and y2 versus x using different vertical
axes. The second vertical axes appears on the right hand side of the graph.
(3) proc gplot ; plot y1*x=z; run; uses z as a classification variable and will produce a single
graph plotting y1 against x for each value of the variable z.
*controlling the axis ranges;
title "Scatter Plot of SBP by DBP";proc gplot data=d.clinic;plot SBP * DBP / haxis=70 to 120 by 5 vaxis=100 to 220 by 10;run;