using sas procedures sas system options – sas system options are parameters you can change that...
TRANSCRIPT
USING SAS PROCEDURES• SAS System Options
– SAS System options are parameters you can change that affect the SAS System – how it works, what the output looks like, error handling, and so forth.
– You can set system options in the SAS System Options window, in an OPTIONS statement, or in a couple of other ways. In this class we will focus on the OPTIONS statement.
• OPTIONS Statement– An OPTIONS statement is submitted as part of a
SAS program and affects all steps that follow it.
OPTIONS Statement
• The OPTIONS statement is one of the special SAS statements that do not belong to either a PROC or a DATA step.
• It can appear anywhere in a program, but often the very first line in a program is an OPTIONS statement. That way you can easily see what options are in effect.
• Any subsequent OPTIONS statements in a program override previous ones.
OPTIONS Statement
• Commonly Used Options– CENTER | NOCENTER This option works as a switch. CENTER
centers output on a page, while NOCENTER left justifies output.– DATE | NODATE This option is also a switch. DATE places today’s
date at the top of each page of output.– NUMBER | NONUMBER This switch controls whether a page
number appears on each page of output.– LINESIZE = n Use LINESIZE to control the maximum length of
output lines. Takes values from 64 to 256. A linesize of 98 or less tends to work well for SOCI6200 assignments.
– PAGESIZE = n Use PAGESIZE to control the maximum number of lines per page of output. Takes values from 15 to 32767. A pagesize of 55 or less tends to work well for SOCI6200 assignments.
– PAGENO = n Starts numbering pages with n.– FORMDLIM=’-‘ Asks SAS to write a line of dashes (--) where
normally it would go to a new page, thus possibly conserving paper.
The PROC Step
• SAS procedures are computer programs that:– read SAS data sets– compute statistics– write results– create SAS data sets
• We will examine the following procedures:– SORT Sorts observations by one or more variables.– CONTENTS Provides information about a SAS data set.– PRINT Prints the observations in a SAS data set using some or
all of the variables.– FREQ Produces one-way to n-way frequency and
crosstabulation tables as well as many measures of association.– UNIVARIATE Provides data summarization tools and
information on the distribution of numeric variables.– MEANS Produces descriptive statistics for numeric variables.
Using SAS Procedures
• Using a procedure is like filling out a form. Someone else designed the form, and all you have to do is fill in the blanks and choose from a list of options. – Most procedures write results to the Output window or to a
listing file, depending on how you are running SAS.– You can customize the appearance of the output using the
system options described above.– Many procedures can also write results to an output SAS
data set using an OUTPUT statement or an OUT= option.– Several other statements can be used with almost every
procedure. These optional statements include BY, WHERE, TITLE, FOOTNOTE, LABEL, and FORMAT.
– All procedures can use the Output Delivery System (ODS) to produce results that are more attractive than the default.
Using SAS Procedures
• PROC Statement– Each PROC step begins with a PROC
statement.– A PROC statement consists of the word
PROC followed by the name of the procedure you want to run, followed by any procedure options you want to set for this run.
– The procedure options should almost always include DATA=<data set name> to specify the data set that you want to analyze.
• This example asks PROC PRINT to print the weight_club data set double-spaced.PROC PRINT DATA=weight_club DOUBLE;
BY Statement
• BY Statement– A BY statement tells the procedure to perform a
separate analysis for each combination of values of the BY variables rather than treating all observations as one group.
– All procedures except PROC SORT assume that your data are already sorted by the variables in your BY statement. If they are not, use PROC SORT to do so.
BY Statement
• This example sorts the weight_club data set by team and then computes the average weight loss for each team.
WHERE Statement
• WHERE Statement– The WHERE statement tells the procedure to
use a subset of the data. The basic form of a WHERE statement is
– Only observations satisfying the condition will be used by the PROC. The left side of the condition must be a variable name, and the right side of the condition can be a variable name, a constant, or a mathematical equation.
WHERE Statement: operators
• Here are the most frequently used operators:
WHERE Statement
• For example, this PROC step prints only the members of the yellow team.
TITLE Statement
• TITLE Statement– Use TITLE statements to place descriptive
information at the top of every page of output. The basic form of a TITLE statement is TITLE followed by the text of the title in quotes.
– You can specify up to ten titles by adding numbers to the keyword TITLE. Keyword TITLE is the same as TITLE1.
– The basic form of a TITLE statement is TITLEn followed by the text of the title in quotes.
TITLE Statement
• TITLE statements can appear anywhere in a program, not just in a PROC step, but it often makes sense to put them in PROC steps since they affect procedure output.
• The maximum length of a title is 256 characters or the current linesize, whichever is less.
• Once you specify a title for a line, it is used for all subsequent output until you cancel the title with a null statement or define another title for that line. A null statement for TITLE2 would be the following:
• The flowing cancels all existing titles.
FOOTNOTE Statement
• FOOTNOTE statement• The FOOTNOTE statement works exactly the
same way as the TITLE statement. However, footnotes print at the bottom of the page.
FOOTNOTE ‘your comments’;FOOTNOTE1 ‘your comments’;FOOTNOTE2 ‘your comments’;FOOTNOTE;
LABEL Statement
• LABEL Statement– By default, SAS uses variable names to label your output,
but with the LABEL statement you can create more descriptive labels, up to 256 characters long, for each variable.
– When a LABEL statement is used in a DATA step, the labels are permanently attached to the variables. When used in a PROC step, the labels stay in effect only for the duration of that step.
– Labels containing special characters such as “ or ; must be enclosed in single quotes.
– Labels containing single quotes must be enclosed in double quotes. Otherwise, labels can be enclosed in either single or double quotes.
FORMAT Statement
• FORMAT Statement– Formats can be used to change the appearance
of printed values – how many decimal places to print, to show a $ when a variable contains amounts of money, etc. Base SAS includes many formats for numeric, character, and date values.
– You use a FORMAT statement to associate formats with particular variables. This statement is mentioned here because it is commonly used in PROC steps. However, we will save a detailed discussion of formats for a later class.
The Output Delivery System (ODS)
• The Output Delivery System (ODS) can be used to make your procedure output more visually appealing and more portable to other software packages than is traditional SAS output. You can make HTML files to display on the web, RTF files to import into Microsoft Word, or PDF files to read with Acrobat Reader. All procedures support ODS.
• We will save a detailed discussion of this topic for later in this module.
An example program
• Here is a simple program that uses many of the statements described above, followed by the output.
Output of the example program
The SORT Procedure
• The SORT procedure sorts observations in a SAS data set by one or more variables, storing the sorted observations in a new SAS data set or replacing the original data set.
• Windows and UNIX use the ASCII collating sequence.
• Sorting character variables:
• Sorting numeric variables
The SORT Procedure
• Basic form of the PROC SORT statement:
• Statement used with PROC SORT:
An example program
PROC SORT DATA=SOCI6200.sales OUT=sales ; BY dept DESCENDING clerk ;RUN;PROC PRINT DATA=sales NOOBS ; TITLE 'Sorting a SAS Data Set' ;RUN;
BY Group Processing
• The BY statement can be used with procedures to perform subgroup processing.
• The PROC will be executed separately for each of the subsets of observations defined by the values of the BY variables.
• Syntax:
• The data set must be sorted as specified in the BY variable list.
An example program and output
PROC SORT DATA=SOCI6200.sales OUT=sales ; BY dept ;RUN;PROC MEANS DATA=sales ; BY dept ; TITLE 'PROC MEANS by DEPT' ;RUN;
The CONTENTS Procedure
• PROC CONTENTS prints the descriptor information from a SAS data set.
• It lists data set attributes followed by all variables and their attributes in alphabetical order. Variable attributes displayed include type, length, label (if present), and format (if present).
• PROC CONTENTS is especially useful for documenting a permanent data set since it displays the data set creation date and date of last modification.
• Basic form of the PROC statement:
An example program and output
PROC CONTENTS DATA=weight_club;RUN;
The PRINT Procedure
• The PRINT procedure lists the observations in a SAS data set, using all or some of the variables.
• Features include:– Automatic formatting– Columns labeled with variable names or labels– Special handling of section and page breaks– If requested, accumulation and printing of
subtotals and totals– Special BY/ID formatting
• Basic form of the PROC statement:
The PRINT Procedure
• Selected procedure statement options:– DOUBLE Double-spaces the output.– LABEL Uses variable labels as column headings. If
this option is not specified, variable names are used as column headings.
– N Prints the number of observations at the end of the printed output.
– NOOBS Suppresses the observation number column in the printed output.
• Special statements supported by PROC PRINT:– VAR variable_list ;– ID variable_list ;– PAGEBY name_of_by_variable ;– SUM variable_list ;– SUMBY name_of_by_variable ;
The PRINT Procedure
• VAR Statement– The VAR statement selects the variables to appear in
the report and determines their order.– The basic form of the VAR statement is
VAR variable-list;
• variable-list can be one of the following– VAR x1 x2 x3;– Range of variables in positional order: VAR name--
age ; – Numeric suffix ranges: VAR x1-x3;– Only numeric variables: VAR _numeric_ ;– Only character variables: VAR _character_ ;– All variables starting with the letter A : VAR A: ; – All variables starting with the letters AB: VAR AB: ;
The PRINT Procedure
• ID Statement• If an ID statement is included in a PROC PRINT
step, observations in the listing are identified by the value of the ID variable(s) rather than by observation number.– The basic form of the ID statement isID variable-list;– The variables in the ID statement list appear
before any variables in the VAR statement list.
– Even without an ID statement, the NOOBS option of the PROC PRINT statement can be used to turn off observation numbers.
The PRINT Procedure
• Sample results without an ID statement
PROC PRINT DATA=weight_club; VAR name team loss;RUN;
The PRINT Procedure
• Sample results with an ID statement
The PRINT Procedure
• PAGEBY Statement– The PAGEBY statement can be used to identify a variable
appearing in the BY statement in the same PROC PRINT step. If the value of this BY variable changes, or if the value of any BY variable that precedes it in the BY statement changes, PROC PRINT begins printing a new page.
– The general form of the PAGEBY statement isPAGEBY name_of_by_variable;
• SUM Statement– The SUM statement identifies any numeric variables to
total in the report.– The general form of the SUM statement is
SUM variable(s);
The PRINT Procedure
• SUMBY Statement– The SUMBY statement can be used with the SUM
statement to limit the number of sums that appear in the report. The SUMBY statement can be used to identify a variable appearing in the BY statement in the same PROC PRINT step. If the value of this BY variable changes, or if the value of any BY variable that precedes it in the BY statement changes, PROC PRINT prints the sums of all variables listed in the SUM statement.
– The general form of the SUMBY statement is
SUMBY name_of_by_variable;
PROC PRINT Examples
# of obs
Sum of
Loss
PROC PRINT DATA=weight_club N NOOBS DOUBLE; SUM loss; TITLE 'Listing of the Weight_Club Data Set'; TITLE2 'Using the N, NOOBS, and DOUBLE Options'; TITLE3 'Also, Using a SUM Statement But No VAR Statement';RUN;
Print certain # of observations
PROC PRINT DATA=SOCI6200.sales(OBS=20); VAR _CHARACTER_; TITLE1 'Listing of the First 20 Observations of the SALES Data Set'; TITLE2 'Only Character Variables';RUN;
PROC PRINT example
PROC SORT DATA=SOCI6200.sales OUT=sales; BY dept clerk;RUN;PROC PRINT DATA=sales(OBS=25) LABEL; BY dept clerk; WHERE price > 60; VAR price cost; SUM price; SUMBY dept; LABEL cost='Cost of Item‘ price='Price of Item';TITLE1 'Partial Listing of Sales Data Set';TITLE2 'Using Selected Options and Statements';RUN;
The FREQ Procedure
• The FREQ procedure produces one-way to n-way frequency and crosstabulation tables.– distributions of variable values (frequency counts)– tests for proportions for one-way tables– combined frequencies for two or more variables– weighted frequencies– measures of association and tests (chi-square and
exact) for n-way tables– ability to output frequencies to a SAS data set– stratified analysis, within and across strata
• Basic form of the PROC FREQ statement:PROC FREQ DATA = data_set_name options ;
The FREQ Procedure
• The TABLES statement:TABLES table_requests / options ;
• one-way: tables variable ;• two-way: tables var_1 * var_2 ;• three-way: tables var_1 * var_2 * var_3 ;
– For two- to n-way tables, the last variable is the column variable and the next-to-last variable is the row variable. Other variables can be called stratum variables.
– Any # of requests can be given in one TABLES statement.– Any # of TABLES statements can be used in one PROC
FREQ step.– By default, missing levels of each variable are excluded
from the table, but the total frequency of missing observations is printed below each table.
The FREQ Procedure: some options
NOFREQ suppresses frequenciesNOCUM suppresses cumulative frequencies and percentagesNOPERCENT suppresses percentagesNOROW suppresses row percentagesNOCOL suppresses column percentagesMISSPRINT prints missing value frequencies for two- to n-way tables;MISSING interprets missing values as non-missing and includes them in calculations of % and other statisticsLIST presents two- to n-way tables in list format rather than as crosst tables
OUT= names an output data set to contain variable values and frequency countsOUTPCT adds percentages to OUT= data setSPARSE lists all possible combinations of the variable values for an n-way table even if a combination does not occur in the data; only has an effect with the LIST or OUT=.ALL requests tests and measures of association produced by CHISQ, MEASURES, and CMHCHISQ requests chi-square tests and measures of association based on chi-squareMEASURES requests measures of associationCMH computes Cochran-Mantel-Haenzel statistics
The FREQ Procedure: examples• Basic example
• Example suppressing percents
• Missing data example
PROC FREQ DATA=sales; TABLES dept clerk dept*clerk;RUN;
PROC FREQ DATA=sales; TABLES dept*clerk / NOROW NOCOL NOPERCENT; TITLE1 "Table with Percents Suppressed";RUN;
PROC FREQ DATA=clin ; TABLES evdnk*visit ; TABLES evdnk*visit / MISSPRINT ; TABLES evdnk*visit / MISSING ; TITLE1 "Missing Data Example" ;RUN;
The FREQ Procedure: examples
• Example using the ALL option
• Using WEIGHT statementWEIGHT var-name-of-counts;
PROC FREQ DATA=sales2 ; TABLES sex*clerk / ALL ; TITLE 'Example with the ALL Option';RUN;
The UNIVARIATE Procedure
• The UNIVARIATE procedure provides data summarization tools, high-resolution graphics displays, and information on the distribution of numeric variables.– Descriptive statistics– Details on extreme values and extreme observations– Median, mode, range, and quantiles– Tests for location and normality– Confidence limits– Low-resolution plots to picture the distribution– High-resolution histograms, quantile-quantile plots, and
probability plots– Frequency tables– Output data sets
The UNIVARIATE Procedure
• Basic form of the PROC UNIVARIATE statement:PROC UNIVARIATE options DATA= data_set_name ;
• Selected options:– FREQ produces a frequency table– NOPRINT suppresses all tables of descriptive statistics; use
when producing an output data set– NORMAL computes a test statistic for the hypothesis that
the input data come from a normal distribution– PLOT produces a stem and leaf plot, a box plot, and a
normal probability plot (all lowresolution)– ROUND specifies units for rounding variable values before
computations– CIBASIC requests confidence limits for the mean, standard
deviation, and variance based on normally distributed data
The UNIVARIATE Procedure
• Selected statements used with PROC UNIVARIATE:– BY variable list ;– VAR variable list ;– FREQ variable ; Specifies a numeric variable whose
values represent the frequency of the observation.– WEIGHT variable ; Specifies weight for the
analysis variables.– ID variable list ; Identifies the extreme observations
in the appropriate table.– OUTPUT OUT=sas_data_set output-statistic-list– PCTLPTS=percentiles PCTLPRE=prefix-name-list– PCTLNAMES=suffix-name-list ;
The UNIVARIATE Procedure
• Basic examplesPROC UNIVARIATE DATA=sales2; VAR price; TITLE1 'PROC UNIVARIATE with No Options';RUN;
PROC UNIVARIATE DATA=sales2 FREQ PLOT NORMAL ; VAR COST ; ID CLERK ; LABEL COST='Total Cost' ; TITLE1 'PROC UNIVARIATE with Several Options and Optional Statements';RUN;
The UNIVARIATE Procedure
• Output data set example
PROC SORT DATA=sales2 OUT=sales; BY sex;RUN;PROC UNIVARIATE DATA=sales NOPRINT; BY sex; VAR cost price; OUTPUT OUT=stats N=ncost nprice NMISS=nmcost nmprice MEAN=mcost mprice MAX=maxcost maxprice ;RUN;PROC PRINT DATA=stats; TITLE1 'Output Data Set from PROC UNIVARIATE';RUN;
The MEANS Procedure
• The MEANS procedure produces simple univariate descriptive statistics for numeric variables.– Selected univariate statistics– Special BY group processing: using a BY or CLASS
statement will cause MEANS to calculate– descriptive statistics separately for groups of
observations– Printed output optional– Output data set of summary statistics optional
The MEANS Procedure
• Basic form of the PROC MEANS statement:PROC MEANS DATA= data_set_name options statistic-keywords;
• Selected options:– FW= specifies the field width to use in printing each
statistic; default is 12– NOPRINT suppresses all printed output– MAXDEC= specifies the maximum number of decimal
places– MISSING requests that PROC MEANS treat missing
values as valid subgroup values for the CLASS variables– NWAY specifies that statistics be output for only the
observations with the highest _TYPE_ value
The MEANS Procedure
• Selected statistic-keywords (default N, MEAN, STD, MIN, and MAX):
MEDIAN the median P1 1st percentileP5 5th percentileP10 10th percentileP90 90th percentileP95 95th percentileP99 99th percentileQ1 25th percentileQ3 75th percentileT t testPROBT p-value based on t
N number of non-missing valuesNMISS number of missing valuesMEAN the meanSUM the sum of all values STD standard deviationSTDERR standard errorVAR varianceMAX maximumMIN minimumCLM confidence interval for the mean
The MEANS Procedure
• Selected statements used with PROC MEANS:– BY variable list ;– VAR variable list ;– CLASS variable list / MISSING; MISSING option
requests that missing values be treated as valid class level values.
– FREQ variable ;– WEIGHT variable ;– OUTPUT OUT=sas_data_set output-statistic-list /
AUTONAME;– ID variable list; Can be used to add extra variables
to the output data set.
The MEANS Procedure: examples
• Basic examples
• BY-group processing / the CLASS statement
PROC MEANS DATA=sales2;TITLE1 'PROC MEANS with No Options';RUN;
PROC MEANS DATA=sales2 MEAN MIN MAX ; VAR cost price ; TITLE1 'PROC MEANS: Request Specific Statistics for Selected Variables';RUN;
PROC SORT DATA=sales2 OUT=sales ; BY sex;RUN;PROC MEANS DATA=sales N MEAN; BY sex ;RUN;
PROC MEANS DATA=sales2 N MEAN; CLASS sex ;RUN;
The MEANS Procedure: examples
• Example: Creating an output SAS data set using a BY statement (also AUTONAME)
PROC SORT DATA=sales2; BY sex;RUN;PROC MEANS DATA=sales2 NOPRINT; BY sex; VAR cost price; OUTPUT OUT=summary N= MEAN= / AUTONAME;RUN;PROC PRINT DATA=summary; TITLE1 'Output SAS Data Set from PROC MEANS'; TITLE2 'with the BY Statement (also AUTONAME)';RUN;
The MEANS Procedure: examples
• Example: Creating an output SAS data set using a CLASS statement with and without the NWAY option. Note how the NWAY option affects the output data set.
PROC MEANS DATA=sales2 NOPRINT ; CLASS sex ; VAR cost price ; OUTPUT OUT=summary N=NCost NPrice MEAN=MeanCost MeanPrice ;RUN;
PROC MEANS DATA=sales2 NOPRINT NWAY; CLASS sex ; VAR cost price ; OUTPUT OUT=summary N=NCost NPrice MEAN=MeanCost MeanPrice ;RUN;
SAS Descriptive Procedures: Summary
• SORT– Used to rearrange observations by one or more sort fields.– Must be used to sort data before a BY statement is used with other
procedures.– Can sort data in ascending or descending order.– Sorts data in ASCII sorting sequence.– Can be used to check for and remove duplicate observations.
• CONTENTS– Displays descriptor part of a SAS data set.– Can be used to get meta-information on existing or new SAS data sets.– To see information similar to PROC CONTENTS output, view the properties of
a data set in the SAS Explorer window.
• PRINT– Displays data part of a SAS data set.– Can be used to check a data set or to produce simple listing reports.– Another way to visually inspect the data portion of a data set is to use the
ViewTable window. E.g., double-click on the data set in the SAS Explorer window.
SAS Descriptive Procedures: Summary
• FREQ– Displays the distribution of variable values in tabular format.– Displays the combined frequency for two or more variables.– Displays measures of association and test statistics for two-way tables.– More appropriate for discrete variables.– Should be used to print out all of the possible values a discrete variable
can have and to get a frequency distribution of each discrete variable.
• UNIVARIATE– Gives detailed information about the distribution of a numeric variable.– More appropriate for continuous variables.– Should be run on any variables to be used in more analytic procedures.– Can be used to output statistics of interest.
• MEANS– Provides simple univariate statistics for numeric variables.– More appropriate for continuous variables.– More condensed output than UNIVARIATE.– Can be used to output statistics of interest.