how do i read a file that uses commas

How do I read a file that uses commas, tabs or spaces as delimiters to separate variables in SAS version 8?

Comma-separated files

It is quite easy to read a file that uses a comma as a delimiter using proc import in SAS version 8. There are two slightly different ways of reading a comma delimited file using proc import. In SAS version 8, a comma delimited file can be considered as a special type of external file with special file extension .csv, which stands for comma-separated-variables. We show here the first sample program making use of this feature. Let's say we have following data stored in a file called comma.csv.

AMC,22,3,2930,0,11:11AMC,17,3,3350,0,11:30AMC,22,,2640,0,12:34Audi,17,5,2830,1,13:20Audi,23,3,2070,1,11:11

Then the following proc import statement will read it in and create a temporary data set called mydata.

proc import datafile="comma.csv" out=mydata dbms=csv replace; getnames=no;run;proc print data=mydata;run;

As you can see in the output below, the data was read properly. Also notice that SAS create default variable names as VAR1-VARn when variables names are not present in the raw data file.

Obs VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 1 AMC 22 3 2930 0 11:11 2 AMC 17 3 3350 0 11:30 3 AMC 22 . 2640 0 12:34 4 Audi 17 5 2830 1 13:20 5 Audi 23 3 2070 1 11:11

You might have a file where you have the names at the top of the file like the one below. With such a file you would like SAS to use the variable names from the file (e.g., make mpg etc.).

make,mpg,rep78,weight,foreign,timeAMC,22,3,2930,0,11:11AMC,17,3,3350,0,11:30AMC,22,,2640,0,12:34Audi,17,5,2830,1,13:20Audi,23,3,2070,1,11:11

We can use the getnames=yes; statement to tell SAS we want it to read the variable names from the first line of the data file, as illustrated below.

proc import datafile="comma1.csv" out=mydata dbms=csv replace; getnames=yes;run;proc print data=mydata;run;

As you can see from the output of the proc print shown below, the data are read correctly.

Obs make mpg rep78 weight foreign time 1 AMC 22 3 2930 0 11:11 2 AMC 17 3 3350 0 11:30 3 AMC 22 . 2640 0 12:34 4 Audi 17 5 2830 1 13:20 5 Audi 23 3 2070 1 11:11

Another way of reading a comma delimited file is to consider a comma as an ordinary delimiter. Here is a program that shows how to use the dbms=dlm and delimiter="," option to read a file just like we did above. Also notice that the external file doesn't have to have .csv extension.

proc import datafile="comma1.txt" out=mydata dbms=dlm replace; delimiter=","; getnames=yes;run;

You may want to create a permanent SAS data file using proc import. Suppose that we want to create a permanent SAS data file called mydata in the directory "c:\dissertation". We can do the following. libname dis v8 "c:\dissertation";proc import datafile="comma1.txt" out=dis.mydata dbms=dlm replace; delimiter=","; getnames=yes;run;

Another feature of proc import is that you can read in the input file starting from a specific row number using datarow= statement. Let's say that we want read from observation 4 on of the text filecomma1.txt. Since variables have names on the first row in the raw data file, we have to use datarow=5.proc import datafile="comma1.txt" out=mydata dbms=dlm replace; delimiter=","; getnames=yes; datarow=5;run;proc print data=mydata;run;

Now we can see from the output below the data has been read correctly.Obs make mpg rep78 weight foreign time 1 Audi 17 5 2830 1 13:20 2 Audi 23 3 2070 1 11:11

On the other hand, if our variables don't have names in the raw file, we need to use getnames=no and datarow=4 as shown below.proc import datafile="comma2.txt" out=mydata dbms=dlm replace; delimiter=","; getnames=no; datarow=4;run;

Tab-delimited files

It is quite easy to read a file that uses a tab as a delimiter using proc import in SAS version 8. There are two slightly different ways of reading a tab delimited file using proc import. In SAS version 8, a tab delimited file can be considered as a special type of external file with file extension .txt. We show here the first sample program making use of this feature. Let's say we have following data stored in a file calledtab.txt.

AMC Concrod 22 2930 4099AMC Pacer 17 3350 4749AMC Sprint 22 2640 3799Buick Century 22 3250 4816Buick Electra 15 4080 7827


proc import datafile="tab.txt" out=mydata dbms=tab replace; getnames=no;run;proc print data=mydata;run;

As you can see in the output below, the data was read properly. Also notice that SAS create default variable names as VAR1-VARn when variables names are not present in the raw data file.

Obs VAR1 VAR2 VAR3 VAR4 1 AMC Concrod 22 2930 4099 2 AMC Pacer 17 3350 4749 3 AMC Sprint 22 2640 3799 4 Buick Century 22 3250 4816 5 Buick Electra 15 4080 7827


MAKE MPG WEIGHT PRICEAMC Concrod 22 2930 4099AMC Pacer 17 3350 4749AMC Sprint 22 2640 3799Buick Century 22 3250 4816Buick Electra 15 4080 7827


proc import datafile="tab1.txt" out=mydata dbms=tab replace; getnames=yes;run;proc print data=mydata;run;


OBS MAKE MPG WEIGHT PRICE 1 AMC Concord 22 2930 4099 2 AMC Pacer 17 3350 4749 3 AMC Spirit 22 2640 3799 4 Buick Century 20 3250 4816 5 Buick Electra 15 4080 7827

Another way of reading a tab delimited file is to consider a tab as an ordinary delimiter. Here is a program that shows how to use the delimiter option to read a file just like we did above.

proc import datafile="tab1.txt" out=mydata dbms=dlm replace; delimiter='09'x; getnames=yes;run;

You may want to create a permanent SAS data file using proc import. Suppose that we want to create a permanent SAS data file called mydata in the directory "c:\dissertation". We can do the following.

libname dis v8 "c:\dissertation";proc import datafile="tab1.txt" out=dis.mydata dbms=dlm replace; delimiter='09'x; getnames=yes;run;

Space-delimited files

It is very easy to read a file that uses a space as a delimiter to separate variables using proc import in SAS version 8. Consider the following sample data file below.

AMC 22 2930 4099AMC 17 3350 4749AMC 22 2640 3799Buick 20 3250 4816Buick 15 4080 7827

Here is a sample program that reads the text file into SAS 8.

proc import datafile="space.txt" out=mydata dbms=dlm replace; getnames=no;run;

Now we can use proc print to see if the data file has been read correctly into SAS 8.

proc print data=mydata;run; Obs VAR1 VAR2 VAR3 VAR4 1 AMC 22 2930 4099 2 AMC 17 3350 4749 3 AMC 22 2640 3799 4 Buick 20 3250 4816 5 Buick 15 4080 7827

Notice that we use the getnames=no option because in the raw data file variables don't have names. SAS 8 will generate variable names as VAR1-VARn. If our raw file has names for variables on the first line as shown below, then we need to use the option getnames=yes. For example, we have following text file called space1.txt.

MAKE MPG WEIGHT PRICEAMC 22 2930 4099AMC 17 3350 4749AMC 22 2640 3799Buick 20 3250 4816Buick 15 4080 7827

Then the following program reads the file in with the variable names.

proc import datafile="space1.txt" out=mydata dbms=dlm replace; getnames=yes;run;

What if we want to the SAS data set created above to be permanent? Let's say we want to save the permanent file in the directory "c:\dissertation". The answer is to use libname statement as shown below.

libname dis v8 "c:\dissertation";proc import datafile="space1.txt" out=dis.mydata dbms=dlm replace; getnames=yes;run;

Another feature of proc import is that you can read in the input file starting from a specific row number using datarow= statement. Let's say that we want read from observation 3 on of the text filespace1.txt. Since variables have names on the first row in the raw data file, we have to use datarow=4.

proc import datafile="space1.txt" out=mydata dbms=dlm replace; getnames=yes; datarow=4;run;proc print data=mydata;run;

Now we can see from the output below the data has been read correctly.

Obs MAKE MPG WEIGHT PRICE 1 AMC 22 2640 3799 2 Buick 20 3250 4816 3 Buick 15 4080 7827

On the other hand, if our variables don't have names in the raw file, we need to use getnames=no and datarow=3 as shown below.

proc import datafile="space1.txt" out=mydata dbms=dlm replace; getnames=no; datarow=3;run;

Other kinds of delimiters

You can use delimiter= on the infile statement to tell SAS what delimiter you are using to separate variables in your raw data file. For example, below we have a raw data file that uses exclamation points ! to separate the variables in the file.

22!2930!409917!3350!474922!2640!379920!3250!481615!4080!7827

The example below shows how to read this file by using delimiter='!' on the infile statement.

DATA cars; INFILE 'readdel1.txt' DELIMITER='!' ; INPUT mpg weight price;RUN; PROC PRINT DATA=cars;RUN;

As you can see in the output below, the data was read properly.

OBS MPG WEIGHT PRICE

1 22 2930 4099 2 17 3350 4749 3 22 2640 3799 4 20 3250 4816 5 15 4080 7827

It is possible to use multiple delimiters. The example file below uses either exclamation points or plus signs as delimiters.

22!2930!409917+3350+474922!2640!379920+3250+481615+4080!7827

By using delimiter='!+' on the infile statement, SAS will recognize both of these as valid delimiters.

DATA cars; INFILE 'readdel2.txt' DELIMITER='!+' ; INPUT mpg weight price;RUN; PROC PRINT DATA=cars;RUN;



1 22 2930 4099 2 17 3350 4749 3 22 2640 3799 4 20 3250 4816 5 15 4080 7827

How do I read a delimited file with missing data in SAS?

It is very convenient to read comma delimited, tab delimited, or other kinds of delimited raw data files. However, you need to be very careful when reading delimited data with missing values. Consider the example raw data file below. Note that the value of mpg is missing for the AMC Pacer and the missing value is signified with two consecutive commas (,,).

AMC Concord,22,2930,4099AMC Pacer,,3350,4749AMC Spirit,22,2640,3799Buick Century,20,3250,4816Buick Electra,15,4080,7827

We read the file using the program below using delimiter=',' to indicate that commas are used as delimiters.

DATA cars1; length make $ 20 ; INFILE 'readdsd.txt' DELIMITER=',' ; INPUT make mpg weight price;RUN; PROC PRINT DATA=cars1; (Placeholder1)There are no sources in the current document.

No table of figures entries found.RUN;

But, as we see below, the data was read incorrectly for the AMC Pacer.

OBS MAKE MPG WEIGHT PRICE 1 AMC Concord 22 2930 4099 2 AMC Pacer 3350 4749 . 3 Buick Century 20 3250 4816 4 Buick Electra 15 4080 7827

SAS does not properly recognize empty values for delimited data unless you use the dsd option. You need to use the dsd option on the infile statement if two consecutive delimiters are used to indicate missing values (e.g., two consecutive commas, two consecutive tabs). Below, we read the exact same file again, except that we use the dsd option.

DATA cars2; length make $ 20 ; INFILE 'readdsd.txt' DELIMITER=',' DSD ; INPUT make mpg weight price;RUN; PROC PRINT DATA=cars2;RUN;

The output is shown below.

OBS MAKE MPG WEIGHT PRICE 1 AMC Concord 22 2930 4099 2 AMC Pacer . 3350 4749 3 AMC Spirit 22 2640 3799 4 Buick Century 20 3250 4816 5 Buick Electra 15 4080 7827

As you see in the output, the data for the AMC Pacer was read correctly because we used the dsd option.

How do I read a delimited file that has embedded delimiters in the data?

Suppose you are reading a comma separated file, but your data contains commas in it. For example, say your file contains age name and weight and looks like the one below.

48,'Bill Clinton',21050,'George Bush, Jr.',180

Say you read this file as you would any other comma delimited file, like the example shown below.

DATA guys1; length name $ 20 ; INFILE 'readdsd2.txt' DELIMITER=',' ; INPUT age name weight ;RUN; PROC PRINT DATA=guys1;RUN;

But, as we see below, the data were not read as we wished. The quotes are treated as data, and George Bush lost the , Jr off his name, and his weight is missing. This is because SAS treated the , in George Bush's name as a indicating the end of the variable, which is not what we wanted.

OBS NAME AGE WEIGHT1 'Bill Clinton' 48 2102 'George Bush 50 .

Below, we use the dsd option to read the same file.

DATA guys2; length name $ 20 ; INFILE 'readdsd2.txt' DELIMITER=',' DSD ; INPUT age name weight ;RUN; PROC PRINT DATA=guys2;RUN;

As you see in the output below, SAS properly treated the quotes as delimiters, and it read in Mr. Bush's name properly and his weight properly.

OBS NAME AGE WEIGHT

1 Bill Clinton 48 210 2 George Bush, Jr. 50 180

What are some common options for the infile statement in SAS?

There are a large number of options that you can use on the infile statement. This is a brief summary of commonly used options. You can determine which options you may need by examining your raw data file e.g., in Notepad, Wordpad, using more (on UNIX) or any other command that allows you to view your data.

Let's start with a simple example reading the space delimited file shown below.

22 2930 409917 3350 474922 2640 379920 3250 481615 4080 7827

The example program shows how to read the space delimited file shown above.

DATA cars; INFILE 'space1.txt' ; INPUT mpg weight price;RUN;

PROC PRINT DATA=cars;RUN;


OBS MPG WEIGHT PRICE 1 22 2930 4099 2 17 3350 4749 3 22 2640 3799 4 20 3250 4816 5 15 4080 7827

Infile options

For more complicated file layouts, refer to the infile options described below.

DLM=The dlm= option can be used to specify the delimiter that separates the variables in your raw data file. For example, dlm=','indicates a comma is the delimiter (e.g., a comma separated file, .csv file). Or,dlm='09'x indicates that tabs are used to separate your variables (e.g., a tab separated file).

DSD The dsd option has 2 functions. First, it recognizes two consecutive delimiters as a missing value. For example, if your file contained the line 20,30,,50 SAS will treat this as 20 30 50 but with the the dsdoption SAS will treat it as 20 30 . 50 , which is probably what you intended. Second, it allows you to include the delimiter within quoted strings. For example, you would want to use the dsd option if you had a comma separated file and your data included values like "George Bush, Jr.". With the dsd option, SAS will recognize that the comma in "George Bush, Jr." is part of the name, and not a separator indicating a new variable.

FIRSTOBS=This option tells SAS what on what line you want it to start reading your raw data file. If the first record(s) contains header information such as variable names, then set firstobs=n where n is the record number where the data actually begin. For example, if you are reading a comma separated file or a tab separated file that has the variable names on the first line, then use firstobs=2 to tell SAS to begin reading at the second line (so it will ignore the first line with the names of the variables).

MISSOVER This option prevents SAS from going to a new input line if it does not find values for all of the variables in the current line of data. For example, you may be reading a space delimited file and that is supposed to have 10 values per line, but one of the line had only 9 values. Without the missover option, SAS will look for the 10th value on the next line of data. If your data is supposed to only have one observation for each line of raw data, then this could cause errors throughout the rest of your data file. If you have a raw data file that has one record per line, this option is a prudent method of trying to keep such errors from cascading through the rest of your data file.

OBS= Indicates which line in your raw data file should be treated as the last record to be read by SAS. This is a good option to use for testing your program. For example, you might use obs=100 to just read in the first 100 lines of data while you are testing your program. When you want to read the entire file, you can remove the obs= option entirely.

A typical infile statement for reading a comma delimited file that contains the variable names in the first line of data would be:

INFILE "test.txt" DLM=',' DSD MISSOVER FIRSTOBS=2 ;

How do I read raw data files compressed with gzip (.gz files) in SAS?

Please note: This FAQ is specific to reading files in a UNIX environment, and may not work in all UNIX environments.

It can be very efficient to store large raw data files compressed with gzip (as .gz files). Such files often are 20 times smaller than the original raw data file. For example, a raw data file that would take 200 megabytes could be compressed to be as small as 10 megabytes. Let's illustrate how to read a compressed file with a small example. Consider the data file shown below.

AMC Concord 220 2930 4099AMC Pacer 170 3350 4749AMC Spirit 220 2640 3799Buick Century 200 3250 4816Buick Electra 150 4080 7827

If this were a raw data file called rawdata.txt we could read it using a SAS program like the one shown below.

FILENAME in "rawdata.txt" ;

DATA test; INFILE in ; INPUT make $ 1-14 mpg 15-18 weight 19-23 price 24-27 ;RUN;

On most UNIX computers (e.g., Nicco, Aristotle) you could compress rawdata.txt by typing

gzip rawdata.txt &

and this would create a compressed version named rawdata.txt.gz . To read this file into SAS, normally you would first uncompress the file, and then read the

uncompressed version into SAS. This can be very time consuming to uncompress the file, and consume a great deal of disk space. Instead, you can read the compressed file rawdata.txt.gz directly within SAS without having to first uncompress it. SAS can uncompress the file "on the fly" and never create a separate uncompressed version of the file. On most UNIX computers (e.g., Nicco, Aristotle) you could read the file with a program like this.

FILENAME in PIPE "gzip -dc rawdata.txt.gz" LRECL=80 ;

DATA test; INFILE in ; INPUT make $ 1-14 mpg 15-18 weight 19-23 price 24-27 ;RUN;

In your program, be sure to change the lrecl=80 to be the width of your raw data file (the width of the longest line of data). If you are unsure of how wide the file is, just use a value that is certainly wider than the widest line of your file.

You would most likely use this technique when you are reading a very large file. You can test your program by just reading a handful of observations by using the obs= parameter on the infile statement, e.g., infile in obs=20; would read just the first 20 observations from your file.

How do I read raw data via FTP in SAS?

SAS has the ability to read raw data directly from FTP servers. Normally, you would use FTP to download the data to your local computer and then use SAS to read the data stored on your local computer. SAS allows you to bypass the FTP step and read the data directly from the other computer via FTP without the intermediate step of downloading the raw data file to your computer. Of course, this assumes that you can reach the computer via the internet at the time you run your SAS program. The program below illustrates how to do this. After the filename in you put ftp to tell SAS to access the data via FTP. After that, you supply the name of the file (in this case 'gpa.txt'. lrecl= is used to specify the width of your data. Be sure to choose a value that is at least as wide as your widest record. cd= is used to specify the directory from where the file is stored. host= is used to specify the name of the site to which you want to FTP. user= is used to provide your userid (or anonymous if connecting via anonymous FTP).pass= is used to supply your password (or your email address if connecting via anonymous FTP).

FILENAME in FTP 'gpa.txt' LRECL=80 CD='/local2/samples/sas/ats/' HOST='cluster.oac.ucla.edu' USER='joebruin' PASS='yourpassword' ;DATA gpa ; INFILE in ; INPUT gpa hsm hss hse satm satv gender ;RUN; PROC PRINT DATA=gpa(obs=10) ;RUN;

As you see below, the program read the data in gpa.txt successfully

OBS GPA HSM HSS HSE SATM SATV GENDER

1 5.32 10 10 10 670 600 1 2 5.14 9 9 10 630 700 2 3 3.84 9 6 6 610 390 1 4 5.34 10 9 9 570 530 2 5 4.26 6 8 5 700 640 1 6 4.35 8 6 8 640 530 1 7 5.33 9 7 9 630 560 2 8 4.85 10 8 8 610 460 2 9 4.76 10 10 10 570 570 2

10 5.72 7 8 7 550 500 1

The log shows that we read 40 records and 7 variables, confirming that we read the data correctly. Since it is possible you could lose your FTP connection and only get part of the data, it is extra important to check the log to see how many observations and variables you read, and to compare that to how many observations and variables you believe the file to have.

NOTE: 40 records were read from the infile IN. The minimum record length was 25. The maximum record length was 25.NOTE: The data set WORK.GPA has 40 observations and 7 variables.

In your program, be sure to change the lrecl=80 to be the width of your raw data file. If you are unsure of how wide the file is, just use a value that is certainly wider than the widest line of your file. You would most likely use this technique when you are reading a very large file. You can test your program by just reading a handful of observations by using the obs= parameter on the infile statement, e.g., infile in obs=20; would read just the first 20 observations from your file.

How can I input multiple raw data files in SAS?

To input multiple raw data files into SAS, you can use the filename statement. For example, suppose that we have four raw data files containing the sales information for a small company, one file for each quarter of a year. Each file has the same variables,

and these variables are in the same order in each raw data set. On the filename statement, we would first provide a name for the files, in this example, we used the name year. Next, in parentheses, we list each of the data files to be included. You can list as many files as you like on the filename statement. In the data step, we use the infile statement and give the name of the files that we used on the filename statement. We use the input statement to list the names of the variables.

First, let's see what the raw data files look like.

quarter1.dat

1 120321 1236 154669 2113261 326264 1326 163354 3126651 420698 1327 142336 4226851 211368 1236 156327 6552371 378596 1429 145678 366578

quarter2.dat

2 140362 1436 114641 3624152 157956 1327 124869 3452152 215547 1472 165578 4125672 204782 1495 150479 3644742 232571 1345 135467 332567

quarter3.dat

3 140357 1339 142693 2058813 149964 1420 152367 2237953 159852 1479 160001 2548743 139957 1527 163567 2630883 150047 1602 175561 277552

quarter4.dat

4 479574 1367 155997 361344 496207 1459 140396 359414 501156 1598 135489 396404 532982 1601 143269 386954 563222 1625 147889 39556filename year ('d:\quarter1.dat' 'd:\quarter2.dat' 'd:\quarter3.dat' 'd:\quarter4.dat');data temp;infile year;input quarter sales tax expenses payroll;run;proc print data = temp;run;Obs quarter sales tax expenses payroll

1 1 120321 1236 154669 211326 2 1 326264 1326 163354 312665 3 1 420698 1327 142336 422685 4 1 211368 1236 156327 655237 5 1 378596 1429 145678 366578 6 2 140362 1436 114641 362415 7 2 157956 1327 124869 345215 8 2 215547 1472 165578 412567 9 2 204782 1495 150479 364474 10 2 232571 1345 135467 332567 11 3 140357 1339 142693 205881 12 3 149964 1420 152367 223795 13 3 159852 1479 160001 254874 14 3 139957 1527 163567 263088 15 3 150047 1602 175561 277552 16 4 479574 1367 155997 36134 17 4 496207 1459 140396 35941 18 4 501156 1598 135489 39640 19 4 532982 1601 143269 38695 20 4 563222 1625 147889 39556

How do I read in a file that uses commas, tabs or spaces as delimiters to separate variables in SAS?

Note: This page is done using SAS version 9.1.3

Comma-separated files

It is quite easy to read a file that uses a comma as a delimiter using proc import in SAS. There are two slightly different ways of reading a comma delimited file using proc import. In SAS, a comma delimited file can be considered as a special

type of external file with special file extension .csv, which stands for comma-separated-variables. We show here the first sample program making use of this feature. Let's say we have following data stored in a file called comma.csv.

AMC,22,3,2930,0,11:11AMC,17,3,3350,0,11:30AMC,22,,2640,0,12:34Audi,17,5,2830,1,13:20Audi,23,3,2070,1,11:11


proc import datafile="comma.csv" out=mydata dbms=csv replace; getnames=no;run;proc print data=mydata;run;

As you can see in the output below, the data was read properly. Also notice that SAS creates default variable names as VAR1-VARn when variables names are not present in the raw data file.

Obs VAR1 VAR2 VAR3 VAR4 VAR5 VAR6 1 AMC 22 3 2930 0 11:11 2 AMC 17 3 3350 0 11:30 3 AMC 22 . 2640 0 12:34 4 Audi 17 5 2830 1 13:20 5 Audi 23 3 2070 1 11:11


make,mpg,rep78,weight,foreign,timeAMC,22,3,2930,0,11:11AMC,17,3,3350,0,11:30AMC,22,,2640,0,12:34Audi,17,5,2830,1,13:20Audi,23,3,2070,1,11:11


proc import datafile="comma1.csv" out=mydata dbms=csv replace; getnames=yes;run;proc print data=mydata;run;


Obs make mpg rep78 weight foreign time 1 AMC 22 3 2930 0 11:11 2 AMC 17 3 3350 0 11:30 3 AMC 22 . 2640 0 12:34 4 Audi 17 5 2830 1 13:20 5 Audi 23 3 2070 1 11:11

Another way of reading a comma delimited file is to consider a comma as an ordinary delimiter. Here is a program that shows how to use the dbms=dlm and delimiter="," option to read a file just like we did above. Also notice that the external file doesn't have to have .csv extension.

proc import datafile="comma1.txt" out=mydata dbms=dlm replace; delimiter=","; getnames=yes;run;

You may want to create a permanent SAS data file using proc import. Suppose that we want to create a permanent SAS data file called mydata in the directory "c:\dissertation". We can do the following. libname dis "c:\dissertation";proc import datafile="comma1.txt" out=dis.mydata dbms=dlm replace; delimiter=","; getnames=yes;run;

Another feature of proc import is that you can read in the input file starting from a specific row number using datarow= statement. Let's say that we want to read from observation 4 of the text filecomma1.txt. Since variables have names on the first row in the raw data file, we have to use datarow=5.proc import datafile="comma1.txt" out=mydata dbms=dlm replace; delimiter=","; getnames=yes; datarow=5;run;proc print data=mydata;run;


Obs make mpg rep78 weight foreign time 1 Audi 17 5 2830 1 13:20 2 Audi 23 3 2070 1 11:11

On the other hand, if our variables don't have names in the raw file, we need to use getnames=no and datarow=4 as shown below.proc import datafile="comma2.txt" out=mydata dbms=dlm replace; delimiter=","; getnames=no; datarow=4;run;

Tab-delimited files

It is quite easy to read a file that uses a tab as a delimiter using proc import in SAS. There are two slightly different ways of reading a tab delimited file using proc import. In SAS, a tab delimited file can be considered as a special type of external file with file extension .txt. We show here the first sample program making use of this feature. Let's say we have the following data stored in a file called tab.txt.

AMC Concrod 22 2930 4099AMC Pacer 17 3350 4749AMC Sprint 22 2640 3799Buick Century 22 3250 4816Buick Electra 15 4080 7827


proc import datafile="tab.txt" out=mydata dbms=tab replace; getnames=no;run;proc print data=mydata;run;

As you can see in the output below, the data was read properly. Also notice that SAS creates default variable names as VAR1-VARn when variables names are not present in the raw data file.

Obs VAR1 VAR2 VAR3 VAR4 1 AMC Concrod 22 2930 4099 2 AMC Pacer 17 3350 4749 3 AMC Sprint 22 2640 3799 4 Buick Century 22 3250 4816 5 Buick Electra 15 4080 7827


MAKE MPG WEIGHT PRICEAMC Concrod 22 2930 4099AMC Pacer 17 3350 4749AMC Sprint 22 2640 3799Buick Century 22 3250 4816Buick Electra 15 4080 7827


proc import datafile="tab1.txt" out=mydata dbms=tab replace; getnames=yes;run;proc print data=mydata;run;


OBS MAKE MPG WEIGHT PRICE 1 AMC Concord 22 2930 4099 2 AMC Pacer 17 3350 4749 3 AMC Spirit 22 2640 3799 4 Buick Century 20 3250 4816 5 Buick Electra 15 4080 7827

Another way of reading a tab delimited file is to consider a tab as an ordinary delimiter. Here is a program that shows how to use the delimiter option to read a file just like we did above.

proc import datafile="tab1.txt" out=mydata dbms=dlm replace; delimiter='09'x; getnames=yes;run;

You may want to create a permanent SAS data file using proc import. Suppose that we want to create a permanent SAS data file called mydata in the directory "c:\dissertation". We can do the following.

libname dis "c:\dissertation";proc import datafile="tab1.txt" out=dis.mydata dbms=dlm replace; delimiter='09'x; getnames=yes;run;

Space-delimited files

It is very easy to read a file that uses a space as a delimiter to separate variables using proc import in SAS. Consider the following sample data file below.

AMC 22 2930 4099AMC 17 3350 4749AMC 22 2640 3799Buick 20 3250 4816Buick 15 4080 7827

Here is a sample program that reads the text file into SAS.

proc import datafile="space.txt" out=mydata dbms=dlm replace; getnames=no;run;

Now we can use proc print to see if the data file has been read correctly into SAS.

proc print data=mydata;run;

Obs VAR1 VAR2 VAR3 VAR4 1 AMC 22 2930 4099 2 AMC 17 3350 4749 3 AMC 22 2640 3799 4 Buick 20 3250 4816 5 Buick 15 4080 7827

Notice that we use the getnames=no option because in the raw data file variables don't have names. SAS will generate variable names as VAR1-VARn. If our raw file has names for variables on the first line as shown below, then we need to use the option getnames=yes. For example, we have following text file called space1.txt.

MAKE MPG WEIGHT PRICEAMC 22 2930 4099AMC 17 3350 4749AMC 22 2640 3799Buick 20 3250 4816Buick 15 4080 7827

Then the following program reads the file in with the variable names.

proc import datafile="space1.txt" out=mydata dbms=dlm replace; getnames=yes;run;

What if we want to the SAS data set created above to be permanent? Let's say we want to save the permanent file in the directory "c:\dissertation". The answer is to use libname statement as shown below.

libname dis "c:\dissertation";proc import datafile="space1.txt" out=dis.mydata dbms=dlm replace; getnames=yes;run;

Another feature of proc import is that you can read in the input file starting from a specific row number using datarow= statement. Let's say that we want to read from observation 3 of the text filespace1.txt. Since variables have names on the first row in the raw data file, we have to use datarow=4.

proc import datafile="space1.txt" out=mydata dbms=dlm replace; getnames=yes; datarow=4;run;proc print data=mydata;run;


Obs MAKE MPG WEIGHT PRICE 1 AMC 22 2640 3799 2 Buick 20 3250 4816 3 Buick 15 4080 7827

On the other hand, if our variables don't have names in the raw file, we need to use getnames=no and datarow=3 as shown below.

proc import datafile="space1.txt" out=mydata dbms=dlm replace; getnames=no; datarow=3;run;

Other kinds of delimiters

You can use delimiter= on the infile statement to tell SAS what delimiter you are using to separate variables in your raw data file. For example, below we have a raw data file that uses exclamation points ! to separate the variables in the file.

22!2930!409917!3350!474922!2640!379920!3250!481615!4080!7827

The example below shows how to read this file by using delimiter='!' on the infile statement.

DATA cars;

INFILE 'readdel1.txt' DELIMITER='!' ; INPUT mpg weight price;RUN; PROC PRINT DATA=cars;RUN;



1 22 2930 4099 2 17 3350 4749 3 22 2640 3799 4 20 3250 4816 5 15 4080 7827

It is possible to use multiple delimiters. The example file below uses either exclamation points or plus signs as delimiters.

22!2930!409917+3350+474922!2640!379920+3250+481615+4080!7827

By using delimiter='!+' on the infile statement, SAS will recognize both of these as valid delimiters.

DATA cars; INFILE 'readdel2.txt' DELIMITER='!+' ; INPUT mpg weight price;RUN; PROC PRINT DATA=cars;RUN;



1 22 2930 4099 2 17 3350 4749 3 22 2640 3799 4 20 3250 4816 5 15 4080 7827

How do I read/write Excel files in SAS?

Reading an Excel file into SAS

Suppose that you have an Excel spreadsheet called auto.xls. The data for this spreadsheet are shown below.

MAKE MPG WEIGHT PRICEAMC Concord 22 2930 4099AMC Pacer 17 3350 4749AMC Spirit 22 2640 3799Buick Century 20 3250 4816Buick Electra 15 4080 7827

Using the Import Wizard is an easy way to import data into SAS. The Import Wizard can be found on the drop down file menu. Although the Import Wizard is easy it can be time consuming if used repeatedly. The very last screen of the Import Wizard gives you the option to save the statements SAS uses to import the data so that they can be used again. The following is an example that uses common options and also shows that the file was imported correctly.

PROC IMPORT OUT= WORK.auto1 DATAFILE= "C:\auto.xls" DBMS=EXCEL REPLACE; SHEET="auto1"; GETNAMES=YES; MIXED=YES; USEDATE=YES; SCANTIME=YES;RUN;

proc print data=auto1;run;

Obs MAKE MPG WEIGHT PRICE

1 AMC Concord 22 2930 4099 2 AMC Pacer 17 3350 4749 3 Amc Spirit 22 2640 3799 4 Buick Century 20 3250 4816 5 Buick Electra 15 4080 7827

First we use the out= statement to tell SAS where to store the data once they are imported.

Next the datafile= statement tells SAS where to find the file we want to import.

The dbms= statement is used to identify the type of file being imported. This statement is redundant if the file you want to import already has an appropriate file extension, for example *.xls.

The replace statement will overwrite an existing file. To specify which sheet SAS should import use

the sheet="sheetname" statement. The default is for SAS to read the first sheet. Note that sheet names can only be 31 characters long.

The getnames=yes is the default setting and SAS will automatically use the first row of data as variable names. If the first row of your sheet does not contain variable names use the getnames=no.

SAS uses the first eight rows of data to determine whether the variable should be read as character or numeric. The default setting mixed=no assumes that each variable is either all character or all numeric. If you have a variable with both character and numeric values or a variable with missing values use mixed=yes statement to be sure SAS will read it correctly.

Conveniently SAS reads date, time and datetime formats. The usedate=yes is the default statement and SAS will read date or time formatted data as a date. When usedate=no SAS will read date and time formatted data with a datetime format. Keep the default statement scantime=yes to read in time formatted data as long as the variable does not also contain a date format.

Example 1: Making a permanent data file

What if you want the SAS data set created from proc import to be permanent? The answer is to use libname statement. Let's say that we have an Excel file called auto.xls in directory "d:\temp" and we want to convert it into a SAS data file (call it myauto) and put it into the directory "c:\dissertation". Here is what we can do.

libname dis "c:\dissertation";proc import datafile="d:\temp\auto.xls" out=dis.myauto replace;

run;

Example 2: Reading in a specific sheet

Sometimes you may only want to read a particular sheet from an Excel file instead of the entire Excel file. Let's say that we have a two-sheet Excel file called auto2.xls. The example below shows how to use the option sheet=sheetname to read the second sheet called page2 in it.

proc import datafile="auto2.xls" out=auto1 replace; sheet="page2";run;

Example 3: Reading a file without variable names

What if the variables in your Excel file do not have variable names? The answer here is to use the statement getnames=no in proc import. Here is an example showing how to do this.

proc import datafile="a:\faq\auto.xls" out=auto replace; getnames=no;run;

Writing Excel files out from SAS

It is very easy to write out an Excel file using proc export in SAS version 8. Consider the following sample data file below.

Obs MAKE MPG WEIGHT PRICE 1 AMC 22 2930 4099 2 AMC 17 3350 4749 3 AMC 22 2640 3799 4 Buick 20 3250 4816 5 Buick 15 4080 7827

Here is a sample program that writes out an Excel file called mydata.xls into the directory "c:\dissertation".

proc export data=mydata outfile='c:\dissertation\mydata.xls' replace;run;

SAS FAQHow do I use keep and drop efficiently?

This module demonstrates how to select variables - using the keep and drop statements - more efficiently. Sometimes data files contain information that is superfluous to a particular analysis, in which case we might want to change the data file to contain only variables of interest. Programs will run more quickly and occupy less storage space if files contain only necessary variables, and you can use the keep anddrop statements in such a way to make your program run more efficiently. The following program builds a SAS file called auto.

DATA auto ; LENGTH make $ 20 ; INPUT make $ 1-17 price mpg rep78 hdroom trunk weight length turn displ gratio foreign ;CARDS;AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 0AMC Spirit 3799 22 . 3.0 12 2640 168 35 121 3.08 0Audi 5000 9690 17 5 3.0 15 2830 189 37 131 3.20 1Audi Fox 6295 23 3 2.5 11 2070 174 36 97 3.70 1BMW 320i 9735 25 4 2.5 12 2650 177 34 121 3.64 1Buick Century 4816 20 3 4.5 16 3250 196 40 196 2.93 0Buick Electra 7827 15 4 4.0 20 4080 222 43 350 2.41 0Buick LeSabre 5788 18 3 4.0 21 3670 218 43 231 2.73 0Buick Opel 4453 26 . 3.0 10 2230 170 34 304 2.87 0Buick Regal 5189 20 3 2.0 16 3280 200 42 196 2.93 0Buick Riviera 10372 16 3 3.5 17 3880 207 43 231 2.93 0Buick Skylark 4082 19 3 3.5 13 3400 200 42 231 3.08 0Cad. Deville 11385 14 3 4.0 20 4330 221 44 425 2.28 0Cad. Eldorado 14500 14 2 3.5 16 3900 204 43 350 2.19 0Cad. Seville 15906 21 3 3.0 13 4290 204 45 350 2.24 0Chev. Chevette 3299 29 3 2.5 9 2110 163 34 231 2.93 0Chev. Impala 5705 16 4 4.0 20 3690 212 43 250 2.56 0Chev. Malibu 4504 22 3 3.5 17 3180 193 31 200 2.73 0Chev. Monte Carlo 5104 22 2 2.0 16 3220 200 41 200 2.73 0Chev. Monza 3667 24 2 2.0 7 2750 179 40 151 2.73 0Chev. Nova 3955 19 3 3.5 13 3430 197 43 250 2.56 0Datsun 200 6229 23 4 1.5 6 2370 170 35 119 3.89 1Datsun 210 4589 35 5 2.0 8 2020 165 32 85 3.70 1Datsun 510 5079 24 4 2.5 8 2280 170 34 119 3.54 1

Datsun 810 8129 21 4 2.5 8 2750 184 38 146 3.55 1Dodge Colt 3984 30 5 2.0 8 2120 163 35 98 3.54 0Dodge Diplomat 4010 18 2 4.0 17 3600 206 46 318 2.47 0Dodge Magnum 5886 16 2 4.0 17 3600 206 46 318 2.47 0Dodge St. Regis 6342 17 2 4.5 21 3740 220 46 225 2.94 0Fiat Strada 4296 21 3 2.5 16 2130 161 36 105 3.37 1Ford Fiesta 4389 28 4 1.5 9 1800 147 33 98 3.15 0Ford Mustang 4187 21 3 2.0 10 2650 179 43 140 3.08 0Honda Accord 5799 25 5 3.0 10 2240 172 36 107 3.05 1Honda Civic 4499 28 4 2.5 5 1760 149 34 91 3.30 1Linc. Continental 11497 12 3 3.5 22 4840 233 51 400 2.47 0Linc. Mark V 13594 12 3 2.5 18 4720 230 48 400 2.47 0Linc. Versailles 13466 14 3 3.5 15 3830 201 41 302 2.47 0Mazda GLC 3995 30 4 3.5 11 1980 154 33 86 3.73 1Merc. Bobcat 3829 22 4 3.0 9 2580 169 39 140 2.73 0Merc. Cougar 5379 14 4 3.5 16 4060 221 48 302 2.75 0Merc. Marquis 6165 15 3 3.5 23 3720 212 44 302 2.26 0Merc. Monarch 4516 18 3 3.0 15 3370 198 41 250 2.43 0Merc. XR-7 6303 14 4 3.0 16 4130 217 45 302 2.75 0Merc. Zephyr 3291 20 3 3.5 17 2830 195 43 140 3.08 0Olds 98 8814 21 4 4.0 20 4060 220 43 350 2.41 0Olds Cutl Supr 5172 19 3 2.0 16 3310 198 42 231 2.93 0Olds Cutlass 4733 19 3 4.5 16 3300 198 42 231 2.93 0Olds Delta 88 4890 18 4 4.0 20 3690 218 42 231 2.73 0Olds Omega 4181 19 3 4.5 14 3370 200 43 231 3.08 0Olds Starfire 4195 24 1 2.0 10 2730 180 40 151 2.73 0Olds Toronado 10371 16 3 3.5 17 4030 206 43 350 2.41 0Peugeot 604 12990 14 . 3.5 14 3420 192 38 163 3.58 1Plym. Arrow 4647 28 3 2.0 11 3260 170 37 156 3.05 0Plym. Champ 4425 34 5 2.5 11 1800 157 37 86 2.97 0Plym. Horizon 4482 25 3 4.0 17 2200 165 36 105 3.37 0Plym. Sapporo 6486 26 . 1.5 8 2520 182 38 119 3.54 0Plym. Volare 4060 18 2 5.0 16 3330 201 44 225 3.23 0Pont. Catalina 5798 18 4 4.0 20 3700 214 42 231 2.73 0Pont. Firebird 4934 18 1 1.5 7 3470 198 42 231 3.08 0Pont. Grand Prix 5222 19 3 2.0 16 3210 201 45 231 2.93 0Pont. Le Mans 4723 19 3 3.5 17 3200 199 40 231 2.93 0Pont. Phoenix 4424 19 . 3.5 13 3420 203 43 231 3.08 0Pont. Sunbird 4172 24 2 2.0 7 2690 179 41 151 2.73 0Renault Le Car 3895 26 3 3.0 10 1830 142 34 79 3.72 1Subaru 3798 35 5 2.5 11 2050 164 36 97 3.81 1Toyota Celica 5899 18 5 2.5 14 2410 174 36 134 3.06 1Toyota Corolla 3748 31 5 3.0 9 2200 165 35 97 3.21 1Toyota Corona 5719 18 5 2.0 11 2670 175 36 134 3.05 1Volvo 260 11995 17 5 2.5 14 3170 193 37 163 2.98 1VW Dasher 7140 23 4 2.5 12 2160 172 36 97 3.74 1VW Diesel 5397 41 5 3.0 15 2040 155 35 90 3.78 1VW Rabbit 4697 25 4 3.0 15 1930 155 35 89 3.78 1VW Scirocco 6850 25 4 2.0 16 1990 156 36 97 3.78 1;RUN;

PROC CONTENTS DATA=auto;RUN;

The proc contents shown below provides information about the file.

CONTENTS PROCEDURE

Data Set Name: WORK.AUTO Observations: 74 Member Type: DATA Variables: 12

-----Alphabetic List of Variables and Attributes-----

# Variable Type Len Pos------------------------------------10 DISPL Num 8 8412 FOREIGN Num 8 10011 GRATIO Num 8 92 5 HDROOM Num 8 44 8 LENGTH Num 8 68 1 MAKE Char 20 0 3 MPG Num 8 28 2 PRICE Num 8 20 4 REP78 Num 8 36 6 TRUNK Num 8 52 9 TURN Num 8 76 7 WEIGHT Num 8 60

If, for example, we wanted to examine the relationship between mpg and price for various makes, but had no interest in the automobile's dimensions, we could create a smaller file, by keeping only these three variables.

DATA auto2; set auto; keep make mpg price;RUN;

To verify the contents of the new file, run the following program.

PROC CONTENTS DATA=AUTO2; RUN; CONTENTS PROCEDUREData Set Name: WORK.AUTO2 Observations: 74 Member Type: DATA Variables: 3 -----Alphabetic List of Variables and Attributes-----

# Variable Type Len Pos-----------------------------------1 MAKE Char 20 03 MPG Num 8 282 PRICE Num 8 20

Note that the number of observations, or records, remains unchanged. This program creates auto2 from the original file auto. The new file, named auto2 is identical to auto except that it contains only the variables listed in the keep statement.

SAS will read into working memory all the variables on the auto file, deleting the unwanted variables only when it writes out the new file auto2. This means that all the

variables on the input file are available for SAS to use during the program. However, it also means that SAS will be working with a larger data set than may be necessary. An alternate way to control the selection of variables is to use SAS data step options, which specifically control the way variables are read from SAS files and/or written out to SAS files, resulting in more efficient use of computer resources.

The following program creates exactly the same file, but is a more efficient program because SAS only reads the desired variables.

DATA auto2; SET auto (KEEP = make mpg price);RUN;

The drop data step option works in a similar way.

DATA AUTO2; SET auto (DROP = rep78 hdroom trunk weight length turn displ gratio foreign);RUN;

The keep data step option can also control which variables are written to the new file.

DATA AUTO2 (keep = make mpg price); SET auto;RUN;

Or, we can use the drop data step option.

DATA AUTO2 (drop = rep78 hdroom trunk weight length turn displ gratio foreign); SET auto; RUN;

In these two examples, all the variables in the auto file are read into working memory. SAS does not, however, include them when it writes out the new file auto2.

The data step option controls the contents of the file whose name it follows in parenthesis. If it modifies the file on the set statement (the file being read) it determines which variables are read. If it modifies the file on the data statement (the file being written) then it controls which variables are written to the new file.

Data step options may be used on both files, as illustrated in the following program.

DATA AUTO2 (drop=weight length); SET auto (keep=weight length); size = weight * length;run;

In this example, SAS reads two variables (weight and length) into working memory, using them to compute a new variable (size). Since weight and length are dropped on the output file, auto2 contains only 1 variable (size).

Be careful that you do not eliminate variables on a keep or drop on the input file, even though you refer to them in the data step.

For more information

For information on making SAS data files from raw data, see Inputting Data into SAS.

For more basic information about subsetting variables or subsetting cases in SAS, see the SAS Learning Module on Subsetting data in SAS.

For information about making permanent SAS data files, see Reading and writing SAS system files.

For more advanced issues in subsetting, data transformations and data manipulation see Data transformations and manipulation in SAS in the SAS Library- Web Page Resources.

SAS FAQHow do I check that the same data input by two people are consistently entered?

When two people enter the same data (double data entry), a concern is whether discrepancies exist between the two datasets (the rationale of double data entry), and if so, where. We start by reading in the two datasets, one entered by person1 and the second by person2.

data person1; input id name $ age ht wt income; datalines;11 john 23 68 145 2300012 charlie 25 72 178 4500013 sally 21 64 135 120004 mike 34 70 156 560043 paul 30 73 189 15600;run;

data person2; input id name $ age ht wt income; datalines;11 john 23.5 68 145 2300012 charles 25 52 178 4500013 sally 21 64 . 120004 michael 34 70 156 560043 Paul 30 73 189 5600;run;

We start by sorting the two datasets by the id variable, id, and then use the compare procedure to see if any discrepancies exist between the two datasets.

proc sort data = person1; by id;run;

proc sort data = person2; by id;run;

proc compare base = person1 compare = person2 novalues;run;

The COMPARE ProcedureComparison of WORK.PERSON1 with WORK.PERSON2(Method=EXACT)

Data Set SummaryDataset Created Modified NVar NObsWORK.PERSON1 18JAN06:09:01:28 18JAN06:09:01:28 6 5WORK.PERSON2 18JAN06:09:01:28 18JAN06:09:01:28 6 5

Variables SummaryNumber of Variables in Common: 6.

Observation SummaryObservation Base CompareFirst Obs 1 1First Unequal 1 1Last Unequal 5 5Last Obs 5 5

Number of Observations in Common: 5.Total Number of Observations Read from WORK.PERSON1: 5.Total Number of Observations Read from WORK.PERSON2: 5.

Number of Observations with Some Compared Variables Unequal: 5.Number of Observations with All Compared Variables Equal: 0.

Values Comparison SummaryNumber of Variables Compared with All Observations Equal: 1.Number of Variables Compared with Some Observations Unequal: 5.Number of Variables with Missing Value Differences: 1.Total Number of Values which Compare Unequal: 7.Maximum Difference: 10000.

Variables with Unequal ValuesVariable Type Len Ndif MaxDif MissDifname CHAR 8 3 0age NUM 8 1 0.500 0ht NUM 8 1 20.000 0wt NUM 8 1 0 1income NUM 8 1 10000 0

The basic compare procedure revealed that differences do exist. We now want to find the discrepancies by id. We use the by statement to give the discrepancies by observations; if we didn't have that statement, discrepancies would have been given by the variables. This statement makes it convenient to correct the errors on a case-by-case basis.

proc compare base = person1 compare = person2 brief; by id; id id;run;

The COMPARE ProcedureComparison of WORK.PERSON1 with WORK.PERSON2(Method=EXACT)

id=4NOTE: Values of the following 1 variables compare unequal: nameValue Comparison Results for Variables_________________________________________________________ || Base Value Compare Value id || name name _______ || ________ ________ || 4 || mike michael_________________________________________________________

id=11NOTE: Values of the following 1 variables compare unequal: ageValue Comparison Results for Variables_________________________________________________________ || Base Compare id || age age Diff. % Diff _______ || _________ _________ _________ _________ || 11 || 23.0000 23.5000 0.5000 2.1739_________________________________________________________

id=12NOTE: Values of the following 2 variables compare unequal: name htValue Comparison Results for Variables_________________________________________________________ || Base Value Compare Value id || name name _______ || ________ ________ || 12 || charlie charles__________________________________________________________________________________________________________________ || Base Compare id || ht ht Diff. % Diff _______ || _________ _________ _________ _________ || 12 || 72.0000 52.0000 -20.0000 -27.7778_________________________________________________________

id=13NOTE: Values of the following 1 variables compare unequal: wtValue Comparison Results for Variables_________________________________________________________ || Base Compare

id || wt wt Diff. % Diff _______ || _________ _________ _________ _________ || 13 || 135.0000 . . ._________________________________________________________

id=43NOTE: Values of the following 2 variables compare unequal: name incomeValue Comparison Results for Variables_________________________________________________________ || Base Value Compare Value id || name name _______ || ________ ________ || 43 || paul Paul_________________________________________________________________________________________________________________ || Base Compare id || income income Diff. % Diff _______ || _________ _________ _________ _________ || 43 || 15600 5600 -10000 -64.1026_________________________________________________________

We note that from the last case, id = 43, the procedure is case sensitive for character variables.

SAS FAQHow do I create a format out of a string variable?

Sometimes, two variables in a dataset may convey the same information, except one being numeric variable and the other being a string variable. For example, in the data set below, we have a numeric variable a coded 1/0 for gender and a string variable b also for gender but with more explicit information. It is easy to use the numeric variable, but we may also want to keep the information given from the string variable. This is a case where we want to create value labels for the numeric variable based on the string variable. In SAS, we will create a format from the string variable and apply the format to the numeric variable.

Example 1: A simple example

We have a tiny data set containing the two variables a and b and two observations.

data test; input a b $;datalines;1 female0 male;run;

Apparently we want to create a format for variable a so that 1 = female and 0 = male. It is easy to create a format simply using the procedure format. For example, we can do the following.proc format; value gender 1 = "female" 0 = "male";run;proc format; select gender;run;----------------------------------------------------------------------------

| FORMAT NAME: GENDER LENGTH: 6 NUMBER OF VALUES: 2 || MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 6 FUZZ: STD ||--------------------------------------------------------------------------||START |END |LABEL (VER. V7|V8 20MAY2004:14:25:17)||----------------+----------------+----------------------------------------|| 0| 0|male || 1| 1|female |----------------------------------------------------------------------------

We can also do the following using the a data step. This approach does not depend on the number of categories of the string variable. The code will be exactly the same. This is definitely easier when the number of categories is large.

data fmt_dataset; retain fmtname "lgender"; set test ; start = a; label = b;run;proc format cntlin = fmt_dataset fmtlib; select lgender; run;----------------------------------------------------------------------------| FORMAT NAME: LGENDER LENGTH: 6 NUMBER OF VALUES: 2 || MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 6 FUZZ: STD ||--------------------------------------------------------------------------||START |END |LABEL (VER. V7|V8 20MAY2004:14:01:06)||----------------+----------------+----------------------------------------|| 0| 0|male || 1| 1|female |----------------------------------------------------------------------------

Example 2: Another simple (but not so simple) example

We have a dataset called test2 and it looks like the following. There are many repeated rows in the dataset. If we apply the same approach from the previous example, SAS will yield an error message saying that the range is repeated, or values overlap. So we need extract a smaller dataset with no repeats in it.

data test2; input group variable $; datalines; 0 group1 0 group1 0 group1 0 group1 1 group2 1 group2 1 group2 1 group2 2 group3 2 group3

2 group3 2 group3 3 group4 3 group4 3 group4 3 group4 ; run;

The easiest way of creating a dataset without repeats is to use proc sql.

proc sql; create table tofmt as select distinct group, variable from test2;quit;proc print data = tofmt;run;Obs group variable

1 0 group1 2 1 group2 3 2 group3 4 3 group4

Now we are ready to create the format out of the dataset tofmt.

data fmt_dataset; retain fmtname "cvar"; set tofmt ; start = group; label = variable;run;proc format cntlin = fmt_dataset fmtlib; select cvar; run;----------------------------------------------------------------------------| FORMAT NAME: CVAR LENGTH: 6 NUMBER OF VALUES: 4 || MIN LENGTH: 1 MAX LENGTH: 40 DEFAULT LENGTH 6 FUZZ: STD ||--------------------------------------------------------------------------||START |END |LABEL (VER. V7|V8 09JUN2008:16:23:21)||----------------+----------------+----------------------------------------|| 0| 0|group1 || 1| 1|group2 || 2| 2|group3 || 3| 3|group4 |----------------------------------------------------------------------------

proc print data = test2; format group cvar.;run;Obs group variable

1 group1 group1 2 group1 group1

3 group1 group1 4 group1 group1 5 group2 group2 6 group2 group2 7 group2 group2 8 group2 group2 9 group3 group3 10 group3 group3 11 group3 group3 12 group3 group3 13 group4 group4 14 group4 group4 15 group4 group4 16 group4 group4

Reference: Creating a format from a data set from SAS online documentation

SAS FAQHow do I display information for all the SAS datasets in a directory?

Let's say that we have a number of SAS data files in a directory and we need to know the number of observations and the number of variables in each data set. Of course, we can always use proc contents on each of the data set, but it can get tedious and the output will get too long really quickly.

There is an easy solution by using the SAS data file sashelp.vtable that SAS creates and updates during an active SAS session.

Here is an example. Let's say we have a directory called c:\data\dissertation and it contains many SAS files. Here is the sas code to display all the SAS files in the directory with information on the number of observations and the number of variables.

libname dis 'c:\data\dissertation';proc print data = sashelp.vtable (where = (libname="DIS")) noobs; var memname nobs nvar;run;memname nobs nvar

MEDICATION_PP 1242 11META20 20 10METARESP 105 14MISFLAT 831 7MONKEYS 123 7MULTRESP 134 12NHIS_SMALL 30663 7OPPOSITES_PP 140 6PEETCOMP 187 9PEETMIS 269 9......

You can find more information on data set sashelp.vtable and other v* data sets located in the SASHELP library from this SAS pageThe SASHELP Library: It Really Does Help You Manage Data

SAS FAQHow do I make unique anonymous ID variables for my data?

Suppose you had a file with 25 observations that had a variable identifying the observations called id and you had information about the observation, here we just have age.DATA orig;INPUT id age;CARDS; 1 3 2 32 3 13 4 16 5 4 6 9 7 43 8 29 9 4310 4711 1312 613 4314 4815 3416 1317 4718 619 3420 4221 4722 4923 2824 2525 39;RUN;

Suppose you want to make a new id variable called newid that is unique for all observations but conceals the identify of who the observation is. The strategy for this

can be done like this.

1. Create a new data file with IDs in it (we will call this newids). Make more IDs than necessary because there may be duplicate IDs.

2. Eliminate any records with duplicate newid in the newids data file.

3. Scramble the order of the newids file (so the order of newid does not give away the person's identity). 4. Merge newids with the original data file (orig), and get rid of the old id variable.

5. During the merge in step 4, make a file called crossref that shows the correspondence between id and newid.

6. Store crossref in a safe place since that file can be used with orig2 to determine the identify of the observations.

1. Here we make newid which is the new random ID and we make ranord which will be used for scrambling the data file.

data NEWIDS; do NOBS = 1 to 40 ; /* we make up 40 observations in case of duplicates */ newid = " " ; /* newid will be 5 characters wide */ do i = 1 to 5; /* create each digit of newid, 1 - 5 */ * make random number 0-35, 0-9, a-z ; rannum = int(uniform(0)*36) ; * if it is 0-9, convert it into 0-9, which is byte(48) - byte(57) ; if (0 <= rannum <= 9) then ranch = byte(rannum + 48) ; * if it is 10-36, convert it into a-z, which is byte(65)-byte(90) ; if (10 <= rannum <= 36) then ranch = byte(rannum + 55); * combine each digit of "newid" ; substr(newid,i,1) = ranch ; end; * make ranord ; ranord = uniform(0) ; output ; end; * just keep "newid" and "ranord" ; keep newid ranord ;run;

2. Get rid of any duplicates in newids. PROC SORT DATA=newids NODUPLICATES; BY newid ;RUN;

3. Scramble the order of newids so the order of the variables does not give any the identify of the observations.PROC SORT DATA=newids ;

BY ranord ;RUN;

4. Now, merge orig with newids. If id is missing, that means we have matched all orig observations with newids and it is a newids without an orig, so we should delete the observation. For orig2 drop idand ranord so the identity is now anonymous.

5. For crossref, keep id and newid so the identity can be looked up by you if you need to. Keep crossref in a safe, secret place.DATA orig2(DROP=id ranord) crossref(KEEP=id newid); MERGE orig newids ; IF (id = .) THEN DELETE ;run;

Show new version of original data file with newid.PROC PRINT DATA=orig2(obs=10);RUN; OBS AGE NEWID 1 3 QMB02 2 32 1QXCR 3 13 VO5FC 4 16 4C63M 5 4 2QQR8 6 9 VT4O5 7 43 W9IFN 8 29 BHPJW 9 43 B0LJQ 10 47 QN0CC

Show cross reference file, with id and newid.PROC PRINT DATA=crossref(obs=10);RUN; OBS ID NEWID 1 1 QMB02 2 2 1QXCR 3 3 VO5FC 4 4 4C63M 5 5 2QQR8 6 6 VT4O5 7 7 W9IFN 8 8 BHPJW 9 9 B0LJQ 10 10 QN0CC

SAS FAQ How do I standardize variables in SAS?

To standardize variables in SAS, you can use proc standard. The example shown below creates a data file cars and then uses proc standard to standardize weight and price.

DATA cars; INPUT mpg weight price ;DATALINES;22 2930 409917 3350 474922 2640 379920 3250 481615 4080 7827;RUN; PROC STANDARD DATA=cars MEAN=0 STD=1 OUT=zcars; VAR weight price ;RUN; PROC MEANS DATA=zcars;RUN;

The mean=0 and std=1 options are used to tell SAS what you want the mean and standard deviation to be for the variables named on the var statement. Of course, a mean of 0 and standard deviation of 1 indicate that you want to standardize the variables. The out=zcars option states that the output file with the standardized variables will be called zcars.

The proc means on zcars is used to verify that the standardization was performed properly. The output below confirms that the variables have been properly standardized.

Variable N Mean Std Dev Minimum Maximum-------------------------------------------------------------------MPG 5 19.2000000 3.1144823 15.0000000 22.0000000WEIGHT 5 -4.44089E-17 1.0000000 -1.1262551 1.5324455PRICE 5 -4.44089E-17 1.0000000 -0.7835850 1.7233892-------------------------------------------------------------------

Often times you would like to have both the standardized variables and the unstandardized variables in the same data file. The example below shows how you can do that. By making extra copies of the variables zweight and zprice, we can standardize those variables and then have weight and price as the unchanged values.

DATA cars2; SET cars; zweight = weight; zprice = price;RUN; PROC STANDARD DATA=cars2 MEAN=0 STD=1 OUT=zcars; VAR zweight zprice ;RUN; PROC MEANS DATA=zcars;RUN;

As before, we use proc means to confirm that the variables are properly standardized.

Variable N Mean Std Dev Minimum Maximum-------------------------------------------------------------------MPG 5 19.2000000 3.1144823 15.0000000 22.0000000WEIGHT 5 3250.00 541.6179465 2640.00 4080.00PRICE 5 5058.00 1606.72 3799.00 7827.00ZWEIGHT 5 -4.44089E-17 1.0000000 -1.1262551 1.5324455ZPRICE 5 -4.44089E-17 1.0000000 -0.7835850 1.7233892-------------------------------------------------------------------

As we see in the output above, zweight and zprice have been standardized, and weight and price remain unchanged.

SAS FAQIs there a quick way to create dummy variables?

Converting a categorical variable to dummy variables can be a tedious process when done using a series of series of if then statements. Consider the following example data file.

DATA auto ; LENGTH make $ 20 ; INPUT make $ 1-17 price mpg rep78 ;CARDS;AMC Concord 4099 22 3 AMC Pacer 4749 17 3 Audi 5000 9690 17 5 Audi Fox 6295 23 3 BMW 320i 9735 25 4 Buick Century 4816 20 3 Buick Electra 7827 15 4 Buick LeSabre 5788 18 3 Cad. Eldorado 14500 14 2 Olds Starfire 4195 24 1 Olds Toronado 10371 16 3 Plym. Volare 4060 18 2 Pont. Catalina 5798 18 4 Pont. Firebird 4934 18 1 Pont. Grand Prix 5222 19 3 Pont. Le Mans 4723 19 3 ;RUN;

The variable rep78 is coded with values from 1 - 5 representing various repair histories. We may create dummy variables for rep78 by writing separate assignment statements for each value as follows:

DATA auto2 ; SET auto ; IF rep78 = 1 THEN rep78_1 = 1; ELSE rep78_1 = 0; IF rep78 = 2 THEN rep78_2 = 1; ELSE rep78_2 = 0; IF rep78 = 3 THEN rep78_3 = 1; ELSE rep78_3 = 0; IF rep78 = 4 THEN rep78_4 = 1; ELSE rep78_4 = 0; IF rep78 = 5 THEN rep78_5 = 1; ELSE rep78_5 = 0;RUN; PROC FREQ DATA=auto2; TABLES rep78*rep78_1*rep78_2*rep78_3*rep78_4*rep78_5 / list ;RUN;

As you see from the proc freq below, the dummy variables were properly created, but it required a lot of if then else statements.

[Output below edited for readability] REP78 REP78_1 REP78_2 REP78_3 REP78_4 REP78_5 Freq Percent------------------------------------------------------------ 1 1 0 0 0 0 2 12.5 2 0 1 0 0 0 2 12.5 3 0 0 1 0 0 8 50.0 4 0 0 0 1 0 3 18.8 5 0 0 0 0 1 1 6.3

Had rep78 ranged from 1 to 10 or 1 to 20, that would be a lot of typing (and prone to error). Here is a shortcut you could use when you need to create dummy variables.

DATA auto3; set auto; ARRAY dummys {*} 3. rep78_1 - rep78_5; DO i=1 TO 5; dummys(i) = 0; END; dummys( rep78 ) = 1; RUN; PROC FREQ DATA=auto3; TABLES rep78*rep78_1*rep78_2*rep78_3*rep78_4*rep78_5 / list ;RUN;

As you see below, the dummy variables were created successfully.

[Output below edited for readability] REP78 REP78_1 REP78_2 REP78_3 REP78_4 REP78_5 Freq Percent----------------------------------------------------------------- 1 1 0 0 0 0 2 12.5 2 0 1 0 0 0 2 12.5 3 0 0 1 0 0 8 50.0 4 0 0 0 1 0 3 18.8 5 0 0 0 0 1 1 6.3

Let's look at each statement in some detail.

ARRAY dummys {*} 3. rep78_1 - rep78_5;

This statement defines an array called dummys that creates five dummy variables rep78_1 to rep78_5 giving each the minimum storage length required, i.e., 3 bytes. You would change rep78_1 to rep78_5to be the names you want for your dummy variables. The asterisk in the brackets tells SAS to automatically count up the

number of new variables based on the number of variables listed at the end of the statement.

DO i=1 TO 5; dummys(i) = 0;END;

This initialized each dummy variable to 0. You would change 5 to be the number values your variable could have.

dummys(rep78) = 1;

Set the appropriate dummy variable to 1. For example, if rep78 = 3, then dummys(dummys( rep78 ) = 1 will assign a value of 1 to the third element in the array, i.e., assign 1 to rep78_3. You would change rep78 to the name of the variable for which you want to create dummy variables.

How to Create Dummy Variables in SAS

Say you have an age variable with five values: 1, 2, 3, 4 and 5, and you want to make five

dummy variables of it.

DATA TEST;

INPUT age;

DATALINES;

1

2

3

4

5

;

Method 1

DATA DUMMYMETHOD1;

SET TEST;

age1=(age=1);

age2=(age=2);

age3=(age=3);

age4=(age=4);

age5=(age=5);

PROC FREQ;

TABLE age1 age2 age3 age4 age5;

RUN;

Note: The statement age1=(age=1) will create a variable called age1 with a value of 1 if

age=1, and 0 if otherwise.

Method 2

DATA DUMMYMETHOD2 (DROP = i);

SET TEST;

ARRAY A {*} age1 age2 age3 age4 age5;

DO i = 1 TO 5;

A(i) = (age=i);

END;

PROC FREQ;


RUN;

Note: An alternative representation to the ARRAY statement above is:

ARRAY A {*} age1-age5;

This alternative is useful if you are creating many dummy variables (assuming age has more

than five unique values).

Method 3

DATA DUMMYMETHOD1;

SET TEST;

IF age=1 then age1=1; ELSE age1 = 0;





PROC FREQ;


RUN;

Note: This option is not useful for creating dummy variables out of a variable with many

unique data values.

SAS FAQHow can I access and use SAS ZIP code data files for the United States?

Note that this page was written using SAS 9.1.3 for Windows.

SAS comes with many helpful datasets. One of these is called sashelp.zipcode, which contains information about city names, FIPS codes , ZIP code centroids (latitude and longitude coordinates) and more. This page will take you through the process of accessing and using this data set and ZIP code related functions.

Finding and updating the SAS ZIP code data file

First let's figure out which version of the zip code data set is stored in the sashelp library.

proc contents data=sashelp.zipcode;run;The CONTENTS Procedure

Data Set Name SASHELP.ZIPCODE Observations 41988Member Type DATA Variables 16Engine V9 Indexes 1Created Wednesday, June 23, 2004 08:14:49 PM Observation Length 192Last Modified Wednesday, June 23, 2004 08:14:49 PM Deleted Observations 0Protection Compressed NOData Set Type Sorted NOLabel zipcodedownload.com April2004, UNIQUE-updated May 2004, Release 9.1.3Data Representation WINDOWS_32Encoding us-ascii ASCII (ANSI)

<some output omitted>

Alphabetic List of Variables and Attributes

# Variable Type Len Format Informat Label

12 AREACODE Num 8 Area Code for ZIP Code. None for APO/FPO 5 CITY Char 35 Name of city/org 9 COUNTY Num 8 FIPS county code. Blank for APO/FPO10 COUNTYNM Char 25 Name of county/parish. No county for APO/FPO15 DST Char 1 ZIP Code obeys Daylight Savings: Y-Yes N-No14 GMTOFFSE Num 8 Diff (hrs) between GMT and time zone for ZIP Code T11 MSA Num 8 Metro Service Area code by common pop; no MSA for rural16 PONAME Char 28 $28. $28. USPS Post Office Name 6 STATE Num 8 Two-digit number (FIPS code) for state/territory 7 STATECOD Char 2 Two-letter abbrev. for state name. E 8 STATENAM Char 25 Full name of state/territory E13 TIMEZONE Char 9 Time Zone for ZIP Code. None for APO/FPO 3 X Num 8 11.6 Longitude (degrees) of the center (centroid) of ZIP Code. 0.0 for APO/FPO

2 Y Num 8 11.6 Latitude (degrees) of the center (centroid) of ZIP Code. 0.0 for APO/FPO 1 ZIP Num 8 Z5. The 5-digit ZIP Code 4 ZIP_ Char 1 ZIP Code Classification:M= APO/FPO P=PO Box CLASS U=Unique zip used for large orgs/businesses/bldgs Blank=Standard/non-unique

From the output of proc contents, we can tell that this version of zipcode data set was created in June of 2004, and it has 16 variables and 41,988 observations. Since SAS updates the zipcode dataset on a regular basis, it is very likely that our version is very out of date. Updates can be found here. (At the time this page was created the latest version was created in October 2007.)

Clearly the version on file is out of date and needs to be updated. Follow the link above to the SAS page. You will need a SAS profile in order to download the current version. Once you have downloaded and unzipped the new file you will need to use proc cimport to import the new data file. The file= statement tells SAS where the file you want to import is stored. The library= statement tells SAS where you want the new file to be read into. After running the program the log file tells us what has happened running proc cimport. First make sure that there are no errors. Since the file was imported successfully the log file shows that the file zipcode_oct07.cpt actually contains three files. The new file that contains unique ZIP codes, zipcode_0704_unique, now contains 18 variables and 41,759 observations.

proc cimport file = 'C:\data\zipcode_oct07\zipcode_oct07.cpt' library = work ;run;NOTE: Proc CIMPORT begins to create/update data set WORK.ZIPCODE_07Q4_UNIQUENOTE: The data set index ZIP is defined.NOTE: Data set contains 18 variables and 41759 observations. Logical record length is 512

NOTE: Proc CIMPORT begins to create/update data set WORK.ZIPMIL_07Q4NOTE: The data set index ZIP is defined.NOTE: Data set contains 18 variables and 539 observations. Logical record length is 512

NOTE: Proc CIMPORT begins to create/update data set WORK.ZIPMISC_07Q4NOTE: The data set index ZIP is defined.NOTE: Data set contains 17 variables and 19 observations. Logical record length is 216

Now that the new file has been imported correctly let's save a permanent copy of the old file before it is replaced. First we use the libname statement to define a permanent library where the old data files can be stored. Then we use a data step to

place a copy of the old data file into the permanent library. Look at the log file to make sure it was copied successfully.

libname zip 'C:\data';data zip.march2004;set sashelp.zipcode;run;2098 libname zip 'C:\data';NOTE: Libref ZIP was successfully assigned as follows: Engine: V9 Physical Name: C:\data

2099 data zip.march2004;2100 set sashelp.zipcode;2101 run;

NOTE: There were 41988 observations read from the data set SASHELP.ZIPCODE.NOTE: The data set ZIP.MARCH2004 has 41988 observations and 16 variables.

Now that you have a backup copy of the original file saved in a different location you can replace it with the newer version. Make sure to check the log output to ensure that everything was copied successfully. As seen below the file sashelp.zipcode now contains 41,759 observations and 18 variables and has been successfully replaced. Remember if there are any problems replacing the file the original is still saved in a permanent library and you can start over.

data sashelp.zipcode;set work.zipcode_07Q4_unique;run;2285 data sashelp.zipcode;2286 set work.zipcode_07Q4_unique;2287 run;

NOTE: There were 41759 observations read from the data set WORK.ZIPCODE_07Q4_UNIQUE.NOTE: The data set SASHELP.ZIPCODE has 41759 observations and 18 variables.

Using the SAS ZIP code data file

Let's take a look at the first ten observations in the data set to get a better understanding on what the data set has.

proc print data = sashelp.zipcode (obs=10);run; ZIP_ Obs ZIP Y X CLASS CITY STATE STATECODE STATENAME COUNTY

1 00501 40.813078 -73.046388 U Holtsville 36 NY New York 103

2 00544 40.813223 -73.049288 U Holtsville 36 NY New York 103 3 00601 18.165950 -66.723627 Adjuntas 72 PR Puerto Rico 1 4 00602 18.383005 -67.186553 Aguada 72 PR Puerto Rico 3 5 00603 18.433236 -67.151954 Aguadilla 72 PR Puerto Rico 5 6 00604 18.505289 -67.135899 P Aguadilla 72 PR Puerto Rico 5 7 00605 18.436149 -67.151346 P Aguadilla 72 PR Puerto Rico 5 8 00606 18.185616 -66.977377 Maricao 72 PR Puerto Rico 93 9 00610 18.285948 -67.144161 Anasco 72 PR Puerto Rico 11 10 00611 18.287716 -66.797578 P Angeles 72 PR Puerto Rico 141

Obs COUNTYNM MSA AREACODE TIMEZONE GMTOFFSET DST PONAME

1 Suffolk 5380 631 Eastern -5 Y Holtsville 2 Suffolk 5380 631 Eastern -5 Y Holtsville 3 Adjuntas 0 787 Atlantic -4 N Adjuntas 4 Aguada 60 787 Atlantic -4 N Aguada 5 Aguadilla 60 787 Atlantic -4 N Aguadilla 6 Aguadilla 60 787 Atlantic -4 N Aguadilla 7 Aguadilla 60 787 Atlantic -4 N Aguadilla 8 Maricao 0 787 Atlantic -4 N Maricao 9 Anasco 4840 787 Atlantic -4 N Anasco 10 Utuado 0 787 Atlantic -4 N Angeles

Besides the data file sashelp.zipcode that SAS offers in its sashelp library, SAS also offers a few ZIP code related functions. The list of functions listed below take a single ZIP code or a variable containing ZIP codes as an argument and return related information. Note that individual ZIP codes can be used with or without quotations for all the functions. However, when using a variable as the argument only theZIPCITY function will accept a string variable. All of the functions accept numeric variables. Additional information including county, area code, latitude and longitude are contained in the datasetsashelp.zipcode as we have seen from the above proc print or from the output of proc contents.

ZIPCITY('90024') will return "Los Angeles, CA" ZIPNAME(90024) will return "CALIFORNIA" ZIPNAMEL('90024') will return "California" ZIPSTATE('90024') will return "CA" ZIPFIPS('90024') will return 6

The following is a short example showing how to use the ZIP functions. First create the data set zipcode. Next create a new data set called info that defines new variables

with city, state and FIPS information. Finally print the new data set to make sure we have the correct results.

data zipcode;input Zipcode ;cards;9002402139646409269260640;run;data info; set zipcode; CityState = zipcity(zipcode); StateName = zipname(zipcode); StateNameL = zipnamel(zipcode); State = zipstate(zipcode); FIPS = zipfips(zipcode);run;proc print data = info;run;Obs Zipcode CityState StateName StateNameL State FIPS

1 90024 Los Angeles, CA CALIFORNIA California CA 6 2 2139 Cambridge, MA MASSACHUSETTS Massachusetts MA 25 3 64640 Gallatin, MO MISSOURI Missouri MO 29 4 92692 Mission Viejo, CA CALIFORNIA California CA 6 5 60640 Chicago, IL ILLINOIS Illinois IL 17

Other resources

Louise Hadden and Mike Zdeb. "ZIP Code 411: A Well-Kept SAS® Secret", SUGI 31.

SAS/GIS 9.1 Spatial Data and Procedure Guide

SAS FAQ:How can I recode my ID variable to be short and numeric?

Sometimes your dataset includes an identifying variable that is unnecessarily long and uninformative. For example, your ID variable may be a string of length 12 with both letters and numbers (i.e. "77A34987BG34"). You may wish to create a new identifying variable that simply maps the complicated ID variable onto integers

starting at 1 and going up to as many unique IDs appear in your dataset. The code below provides an example of how to do this.

data test; input id a b; cards;9385793487 0 03598437987 1 05987398759 1 09593859853 0 15987398759 0 09385793487 0 03598437987 0 17892343344 1 1;

proc print data = test;run;

Obs id a b

1 9385793487 0 0 2 3598437987 1 0 3 5987398759 1 0 4 9593859853 0 1 5 5987398759 0 0 6 9385793487 0 0 7 3598437987 0 1 8 7892343344 1 1

proc sort data = test; by id;run;

data test2; set test; by id; retain newid 0; if first.id then newid = newid + 1;run;

proc print data = test2; run;

Obs id a b newid

1 3598437987 1 0 1 2 3598437987 0 1 1 3 5987398759 1 0 2 4 5987398759 0 0 2 5 7892343344 1 1 3 6 9385793487 0 0 4 7 9385793487 0 0 4 8 9593859853 0 1 5

Now our dataset has a short and informative identifying variable.

SAS FAQ How can I increment dates in SAS?

The intnx function increments dates by intervals. It computes the date (or datetime) of the start of each interval. For example, let's suppose that you had a column of days

of the month, and you wanted to create a new variable that was the first of the next month. You could use the intnx function to help you create your new variable.

The syntax of the intnx function is: intnx(interval, from, n <, alignment>), where interval is a character (e.g., string) constant or variable, from is the starting value (either a date or datetime), n is the number of intervals to increment, and alignment is optional and controls the alignment of the dates.

data temp2;input id 1 @3 date mmddyy11.;cards;1 11/12/19802 10/20/19963 12/21/1999;run;

proc print data = temp2;format date date9.;run;id date

1 12NOV1980 2 20OCT1996 3 21DEC1999data temp3;set temp2;new_month = intnx('month',date,1);run;proc print data = temp3 noobs;format date new_month date9.;run;id date new_month

1 12NOV1980 01DEC1980 2 20OCT1996 01NOV1996 3 21DEC1999 01JAN2000

Now let's try another example, this time creating a variable that is two days later than the day given in our data set.

data temp3a;set temp2;two_days = intnx('day',date,2);run;proc print data = temp3a noobs;format date two_days date9.;run;id date two_days

1 12NOV1980 14NOV1980 2 20OCT1996 22OCT1996 3 21DEC1999 23DEC1999

SAS FAQHow do I create time series variables using proc expand?

Proc expand is a very useful procedure for working with time series data in terms of creating time series variables and plotting trends. We are going to show some examples here using data

setsp500.sas7bdat. In particular, we are going to use variable date and open (price) only. Here are the first ten observations of these two variables

Obs DATE OPEN

1 01/04/2001 1347.56 2 01/05/2001 1333.34 3 01/08/2001 1298.35 4 01/09/2001 1295.86 5 01/10/2001 1300.80 6 01/11/2001 1313.27 7 01/12/2001 1326.82 8 01/16/2001 1318.32 9 01/17/2001 1326.65 10 01/18/2001 1329.89

Example 1. Creating a moving average variable

Let's say that we need to create a variable of moving average of window size of 5 for the variable open. It can be done easily using the convert statement. Here we specify a new variable nameopen_ma for the new variable of moving average and in the option transformout we specify that the transformation is moving average (movave) with window size of 5. We also use "out = ma" for the name of the data set that includes the new variable open_ma. If option "out = " is missing, SAS will create the new data set anyway and will name it as data1 or such.

proc sort data = sp500; by date;run;*generating moving average variable;proc expand data = sp500 out = ma; convert open = open_ma / transformout=( movave 5);run;proc print data = ma (obs=10);var date open open_ma;run;Obs DATE OPEN open_ma

1 01/04/2001 1347.56 1347.56 2 01/05/2001 1333.34 1340.45 3 01/08/2001 1298.35 1326.42 4 01/09/2001 1295.86 1318.78 5 01/10/2001 1300.80 1315.18 6 01/11/2001 1313.27 1308.32 7 01/12/2001 1326.82 1307.02 8 01/16/2001 1318.32 1311.01 9 01/17/2001 1326.65 1317.17 10 01/18/2001 1329.89 1322.99

Example 2. Creating lag and lead variables

There are many transformations that SAS offers and you can view the list and more details following the link. Here is an example of creating a lag variable and a lead variable.

proc expand data = sp500 out=ma1 method=none; convert open = open_lag1 /transformout = (lag 1); convert open = open_lead4 /transformout = (lead 4);run;proc print data =ma1 (obs=10); var date open open_lag1 open_lead4;run; open_ open_Obs DATE OPEN lag1 lead4

1 01/04/2001 1347.56 . 1300.80 2 01/05/2001 1333.34 1347.56 1313.27 3 01/08/2001 1298.35 1333.34 1326.82 4 01/09/2001 1295.86 1298.35 1318.32 5 01/10/2001 1300.80 1295.86 1326.65 6 01/11/2001 1313.27 1300.80 1329.89 7 01/12/2001 1326.82 1313.27 1347.97 8 01/16/2001 1318.32 1326.82 1342.54 9 01/17/2001 1326.65 1318.32 1342.90 10 01/18/2001 1329.89 1326.65 1360.40

proc print data =ma1 (firstobs=236); var date open open_lag1 open_lead4;run; open_ open_Obs DATE OPEN lag1 lead4

236 12/14/2001 1119.38 1137.07 1149.56237 12/17/2001 1123.09 1119.38 1139.93238 12/18/2001 1134.36 1123.09 1144.89239 12/19/2001 1142.92 1134.36 1144.65240 12/20/2001 1149.56 1142.92 1149.37241 12/21/2001 1139.93 1149.56 1157.13242 12/24/2001 1144.89 1139.93 1161.02243 12/26/2001 1144.65 1144.89 .244 12/27/2001 1149.37 1144.65 .245 12/28/2001 1157.13 1149.37 .246 12/31/2001 1161.02 1157.13 .

Example 3. Filling-in the gaps

Many times a time series has gaps between two time points. For example, the first ten observations of our example data set goes from 01/04/2001 to 01/18/2001. Proc expand offers many different methods for filling in the gaps. In this example, we will use the "method=step" option to fill the gaps with most the recent input value.

proc expand data = in.sp500 out=daily to=day method=step; convert open = daily_open; id date;

run;proc print data = daily (obs=20) noobs;run; daily_ DATE open

04JAN2001 1347.5605JAN2001 1333.3406JAN2001 1333.3407JAN2001 1333.3408JAN2001 1298.3509JAN2001 1295.8610JAN2001 1300.8011JAN2001 1313.2712JAN2001 1326.8213JAN2001 1326.8214JAN2001 1326.8215JAN2001 1326.8216JAN2001 1318.3217JAN2001 1326.6518JAN2001 1329.8919JAN2001 1347.9720JAN2001 1347.9721JAN2001 1347.9722JAN2001 1342.5423JAN2001 1342.90

SAS FAQ:How can I create lag and lead variables in longitudinal data?

When looking at data across consistent units of time (years, quarters, months), there is often interest in creating variables based on how data for a given time period compares to the periods before and after. If you have longitudinal data, you wish to look across units of time within a single subject. When your data is in long form (one observation per time point per subject), this can easily be handled in Stata with standard variable creation steps because of the way in which Stata processes datasets: it stores the entire dataset and can easily refer to any point in the dataset when generating variables. SAS works differently. SAS variables are typically created through a data step in which SAS moves through the dataset, observation by observation, carrying out the calculations for the given observation and accessing only one observation at a time. This system of data storage and access makes it possible for SAS to analyze large datasets but also very difficult to create time series variables in SAS using a data step. However, proc expand provides an easy-to-use alternative to the data step.

Let's start with an example dataset containing only one subject. The dataset below contains US unemployment rates from September, 2006 to August, 2008.

data unemp; input year month rate @@; date = mdy( month, 1 , year ); format date yymm.; datalines; 2006 09 4.5 2006 10 4.42006 11 4.5 2006 12 4.42007 01 4.6 2007 02 4.52007 03 4.4 2007 04 4.52007 05 4.5 2007 06 4.62007 07 4.7 2007 08 4.72007 09 4.7 2007 10 4.82007 11 4.7 2007 12 52008 01 4.9 2008 02 4.82008 03 5.1 2008 04 52008 05 5.5 2008 06 5.52008 07 5.7 2008 08 6.1;

proc print data = unemp (obs = 5); run;

Obs year month rate date 1 2006 9 4.5 2006M09 2 2006 10 4.4 2006M10 3 2006 11 4.5 2006M11 4 2006 12 4.4 2006M12 5 2007 1 4.6 2007M01

For each month, we wish to know the difference between its rate and the rate of the previous month (r(i) - r(i-1)), its rate and the rate of the next month (r(i+1) - r(i)), and

these two differences ((r(i+1)-r(i))-(r(i)-r(i-1)). To do this, we will use proc expand to generate a new dataset including these variables. In the proc expand line, we will name the new dataset unemp_laglead. We indicate that we do not wish to transform the values (using a spline, for example) but simply to grab the untransformed data from the specified record. We indicate that our time series is defined by date in the id line and in the threeconvert lines, we create the three values we wish to have for each time point in our data: the rate, the previous rate (rate_lag1), and the next rate (rate_lead1). In each line, we tell SAS the name of the variable in our new dataset, the type of transformation (lag, lead) and the number of time points to look back or ahead for the transformation (1 in this example).

proc expand data=unemp out=unemp_laglead method = none; id date; convert rate = rate_lag1 / transformout=(lag 1); convert rate; convert rate = rate_lead1 / transformout=(lead 1); run;

We can see the resulting dataset.

proc print data = unemp_laglead (obs = 5); run;

rate_ rate_Obs date lag1 rate lead1 year mont 1 2006M09 . 4.5 4.4 2006 9 2 2006M10 4.5 4.4 4.5 2006 10 3 2006M11 4.4 4.5 4.4 2006 11 4 2006M12 4.5 4.4 4.6 2006 12 5 2007M01 4.4 4.6 4.5 2007 1

Based on this dataset, we can now easily calculate the three time series variables we described earlier. But what if we had data for multiple countries? The dataset below contains unemployment data from 2000-2005 for three countries.

data unemp_international; input country $ year rate @@; datalines;US 2000 4 Canada 2000 6.1 UK 2000 5.5US 2001 4.7 Canada 2001 6.5 UK 2001 5.1US 2002 5.8 Canada 2002 7 UK 2002 5.2US 2003 6 Canada 2003 6.9 UK 2003 5US 2004 5.5 Canada 2004 6.4 UK 2004 4.8US 2005 5.1 Canada 2005 6 UK 2005 4.9;

proc print data = unemp_international (obs = 5); run;

Obs country year rate 1 US 2000 4.0 2 Canada 2000 6.1

3 UK 2000 5.5 4 US 2001 4.7 5 Canada 2001 6.5

We wish to create lag and lead variables within each country. To do this, we can use proc expand with a by statement after sorting on country.

proc sort data = unemp_international; by country;run;

proc expand data=unemp_international out=unemp_int2 method = none; by country; id year; convert rate = rate_lag1 / transformout=(lag 1); convert rate; convert rate = rate_lead1 / transformout=(lead 1); run;

proc print data = unemp_int2; run;

rate_ rate_ Obs country year lag1 rate lead1 1 Canada 2000 . 6.1 6.5 2 Canada 2001 6.1 6.5 7.0 3 Canada 2002 6.5 7.0 6.9 4 Canada 2003 7.0 6.9 6.4 5 Canada 2004 6.9 6.4 6.0 6 Canada 2005 6.4 6.0 . 7 UK 2000 . 5.5 5.1 8 UK 2001 5.5 5.1 5.2 9 UK 2002 5.1 5.2 5.0 10 UK 2003 5.2 5.0 4.8 11 UK 2004 5.0 4.8 4.9 12 UK 2005 4.8 4.9 . 13 US 2000 . 4.0 4.7 14 US 2001 4.0 4.7 5.8 15 US 2002 4.7 5.8 6.0 16 US 2003 5.8 6.0 5.5 17 US 2004 6.0 5.5 5.1 18 US 2005 5.5 5.1 .

With proc expand, you can also generate moving averages, splines, and interpolated values. For more details, see the proc expand pages of the SAS Online Documentation.

SAS FAQHow can I find things in a character variable in SAS?

You can find a specific character, such as a letter, a group of letters, or special characters, by using the index function. For example, suppose that you have a data file with names and other information and you want to identify only those records for people with "Harvey" in their name. You could use the index function as shown below. First, let's input an example data set and use proc print to see that it was entered correctly.

data temp;input name $ 1-12 age;cards;Harvey Smith 30John West 35Jim Cann 41James Harvey 32Harvy Adams 33;run;

proc print data = temp;run;Obs name age

1 Harvey Smith 30 2 John West 35 3 Jim Cann 41 4 James Harvey 32 5 Harvy Adams 33

Now, let's use the index function to find the cases with "Harvey" in the name.

data temp1;set temp;x = index(name, "Harvey");run;

proc print data = temp1;run;Obs name age x

1 Harvey Smith 30 1 2 John West 35 0 3 Jim Cann 41 0 4 James Harvey 32 7 5 Harvy Adams 33 0

The values of the variable x tell us the first location in the variable name where SAS encountered the word "Harvey". In the second observation, John West does not have the word "Harvey" in his name, so a value of 0 was returned.

Now let's suppose that you wanted to search for one of several characters in a string variable. For example, perhaps you want to search for "-", "_" or "X". To accomplish

this, you could use the indexcfunction, which will allow you to supply multiple excerpts. The variable found1 is included to show why you cannot use the index function and supply it will all of the characters for which you are searching.

data temp3;input string $ 1-11;cards;4-5 abc XxX11_ jkl xxxabc 3-5 jjjxXx ()1 lllxxx 344 aaa;run;

data temp4;set temp3;found = indexc(string, "-", "_", "X");found1 = index(string, "-_X");run;

proc print data = temp4;run;Obs string found found1

1 4-5 abc XxX 2 0 2 11_ jkl xxx 3 0 3 abc 3-5 jjj 6 0 4 xXx ()1 lll 2 0 5 xxx 344 aaa 0 0

As you can see from the output above, the value in the variable found indicates the position that the first of any of the characters listed in the indexc function was encountered.

SAS FAQHow can I get rid of extra spaces in a string variable?

Sometimes, a string variable can have many words in it and extra spaces between the words. The easiest way to get rid of the extra spaces is to use SAS function compbl. Here is an example.

data test; length address1 $40. address2 $60.; input address1 $ 1-20 address2 $ 21-80; datalines; 1234 Washington St DC 12345 1234 Irving St Charlotte NC 12345 45 Wall street New York NY 90454 ; run;data test2; set test; address_compbl = compbl(address1)||compbl(address2);run;proc print data = test2;run;Obs address1 address2

1 1234 Washington St DC 12345 2 1234 Irving St Charlotte NC 12345 3 45 Wall street New York NY 90454

Obs address_compbl

1 1234 Washington St DC 12345 2 1234 Irving St Charlotte NC 12345 3 45 Wall street New York NY 90454

SAS FAQ:How can I merge two files on address?

Given two data sets containing different pieces of information for an overlapping set of addresses, you may be interested in merging your datasets by address. Different data sources can represent the same address in very different ways, so this can be a daunting task. This page aims to offer some data management tips that will prevent errors and help in detecting true matches. We have included some code fragments and suggestions for SAS procedures that are especially useful when working with addresses.

Data Management Tips

1. Separate the addresses into components (street number, street name, city, state, zip). This can be made easier with the SAS functions scan, substring, and the prx functions (prxparse, prxmatch,prxsubstr, etc.). You may later combine them again into one variable, but cleaning and formatting address components will be easier with separate variables.

In the dataset created below, the street number, street name, and street type all appear in the variable address1 and city, state, and zip code all appear in the variable address2. We use scan and substringhere to separate address1 into three component variables. For details on how to use these functions, see SAS's online documentation.

data test; length address1 $40. address2 $60.; input address1 $ 1-20 address2 $ 21-80; datalines; 1234 Washington St DC 12345 1234 Irving St Charlotte NC 12345 45 Wall street New York NY 90454 ;run;

proc print data = test; run;

Obs address1 address2 1 1234 Washington St DC 12345 2 1234 Irving St Charlotte NC 12345 3 45 Wall street New York NY 90454

data test; set test; streettype=scan(address1,-1); streetnum=scan(address1,1); streettype_start = index(address1, trim(streettype)); streetnum_end = length(streetnum);

streetname = trim(substr(address1, streetnum_end+1, (streettype_start-streetnum_end-1))); drop streettype_start streetnum_end;run;

proc print data = test; var streetnum streetname streettype;run;

Obs streetnum streetname streettype 1 1234 Washington St 2 1234 Irving St 3 45 Wall street

If your data are less consistent, regular expressions may be the best approach. For help with regular expressions in SAS, see our related code fragments page.

2. Within your string variables, trim leading and trailing spaces and check that the variable length is not truncating any of the values. For help with this, see SAS FAQ: How can I get rid of extra spaces in a string variable?. If your strings have both capital and lowercase letters, use either the upcase or lowcase function to force your string variables to be one or the other. If you have a string variable that contains only digits (like the variable streetnum in the above example), consider converting it to a numeric variable to avoid worrying about spacing or formatting.

data test; set test; streetname_cap = upcase(streetname); streetnum_num = 1*streetnum;run;

3. Focus on the information in your data that is most meaningful in matching records. There is likely information within an address that is redundant and more difficult to match. For example, given a zip code, city and state are redundant pieces of information that may be difficult to match due to different spelling/spacing/capitalization patterns. In the example above, streettype looks like it does not add information, but might complicate a merge.

4. Test your code as you go. Start with a very small set of observations from each dataset that includes some observations that you believe should match and some that should not. Run your merge and see if it works they way you believe it should. Use in = options in your merge so that you can quickly see how many records are not being matched. In the code below, we create a variable that indicates the outcome of the merge for each observation.

data merged_data; merge address_A (in = A) address_B (in = B); by streetnum streetname zip;

if A and not B then match = "A only"; if B and not A then match = "B only"; if A and B then match = "match";run;

Common Issues to Look Out For

Numbered streets that can be presented as numbers (2nd St., 5th Ave.) or spelled out (Second St., Fifth Ave.).

Zip codes with leading zeroes (these occur in New England) that may or may not appear.

Directions on streets (N. Taylor Rd., E. 154th St.). Fractions in addresses (332 1/2 Ashton Ave.). Different indications of apartments/condos/units/suites/floors in a given

building.

You may face all or none of the issues listed above. Most of them can be addressed with the procedures presented in the code fragments or links shown above.

Final Thoughts

Keep in mind that matching on addresses is difficult under the best circumstances. Being realistic in your expectations and knowing the quality of your data can allow you to know when you have hit a reasonable match rate.

SAS FAQHow can I see the number of missing values and patterns of missing values in a data file?

Sometimes, a data set may have "holes" in them, i.e., missing values and we may want to know the number of missing values of all the variables and the distribution of the missing values. We will use the following data set as our example data set.

data test;input landval improval totval salepric saltoapr city $6. season $8.;datalines; 30000 64831 94831 118500 1.25 A spring 30000 50765 80765 93900 . winter 46651 18573 65224 . 1.16 B 45990 91402 . 184000 1.34 C winter 42394 . 40575 168000 1.43 . 3351 51102 169000 1.12 D winter 63596 2182 65778 . 1.26 E spring 56658 53806 10464 255000 1.21 51428 72451 . . 1.18 F spring 93200 . 4321 422000 1.04 76125 78172 54297 290000 1.14 G winter . 61934 16294 237000 1.10 H spring 65376 34458 . 286500 1.43 winter 42400 . 57446 . . K 40800 92606 33406 168000 1.26 S ;run;

1. Number of missing values vs. number of non missing values in each variable

The first thing we are going to look at the variables that have a lot of missing values. For numerical variables, we use proc means with the options n and nmiss.

proc means data = test n nmiss; var _numeric_;run; NVariable N Miss----------------------LANDVAL 13 2IMPROVAL 12 3TOTVAL 12 3SALEPRIC 11 4SALTOAPR 13 2

For character variables, we can use proc freq to display the number of missing values in each variable.

proc freq data = test; tables city season ;

run; Cumulative Cumulativecity Frequency Percent Frequency Percent---------------------------------------------------------A 1 10.00 1 10.00B 1 10.00 2 20.00C 1 10.00 3 30.00D 1 10.00 4 40.00E 1 10.00 5 50.00F 1 10.00 6 60.00G 1 10.00 7 70.00H 1 10.00 8 80.00K 1 10.00 9 90.00S 1 10.00 10 100.00

Frequency Missing = 5

Cumulative Cumulativeseason Frequency Percent Frequency Percent-----------------------------------------------------------spring 4 44.44 4 44.44winter 5 55.56 9 100.00

Frequency Missing = 6

2. Number of missing values in each observation

We can also look at the number of missing values in each observation. For example, we can use SAS function cmiss to store the number of missing values from both numeric and character variables in each observation.

data test1; set test; miss_n = cmiss(of landval -- season);run;

proc print data = test1; run; Obs landval improval totval salepric saltoapr city season miss_n

1 30000 64831 94831 118500 1.25 A spring 0 2 30000 50765 80765 93900 . winter 2 3 46651 18573 65224 . 1.16 B 2 4 45990 91402 . 184000 1.34 C winter 1 5 42394 . 40575 168000 1.43 3 6 . 3351 51102 169000 1.12 D winter 1

7 63596 2182 65778 . 1.26 E spring 1 8 56658 53806 10464 255000 1.21 2 9 51428 72451 . . 1.18 F spring 2 10 93200 . 4321 422000 1.04 3 11 76125 78172 54297 290000 1.14 G winter 0 12 . 61934 16294 237000 1.10 H spring 1 13 65376 34458 . 286500 1.43 winter 2 14 42400 . 57446 . . K 4 15 40800 92606 33406 168000 1.26 S 1

3. Distribution of missing values

We can also look at the patterns of missing values. We can recode each variable into a dummy variable such that 1 is missing and 0 is nonmissing. Then we use the proc freq with statement tables with optionlist to compute the frequency for each pattern of missing data.

data miss_pattern (drop=i); set test; array mynum(*) _numeric_; do i=1 to dim(mynum); if mynum(i) =. then mynum{i}=1; else mynum(i)=0; end; array mychar(*) $ _character_; do i=1 to dim(mychar); if mychar(i) ="" then mychar{i}=1; else mychar(i)=0; end;run;proc freq data=miss_pattern; tables landval*improval*totval*salepric*saltoapr*city*season /list;run;landval improval totval salepric saltoapr city season--------------------------------------------------------------------------- 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1

0 1 0 1 1 0 1 1 0 0 0 0 0 0

Cumulative CumulativeFrequency Percent Frequency Percent------------------------------------------------- 2 13.33 2 13.33 1 6.67 3 20.00 1 6.67 4 26.67 1 6.67 5 33.33 1 6.67 6 40.00 1 6.67 7 46.67 1 6.67 8 53.33 1 6.67 9 60.00 1 6.67 10 66.67 2 13.33 12 80.00 1 6.67 13 86.67 2 13.33 15 100.00

Now we see that there are two observations with no missing values, one observation with one missing value in variable season, and so on.

SAS FAQHow do I specify types of missing values?

When a data file has missing values, sometimes we may want to be able to distinguish between different types of missing values. For example, we can have missing values because of non-response or missing values because of invalid data entry. The examples here are related to this issue.

Example 1: Specifying types of missing values in a data set

In SAS, we can use letters A-Z and underscore "_" to indicate the type of missing values.

In the example below, variable female has value -999 indicating that the subject refused to answer the question and value -99 indicating a data entry error. It is the same with variable ses. The first code fragment hard codes the changes, the second does the operation in an array.

data test1; input score female ses ;datalines;56 1 1 62 1 2 73 0 367 -999 157 0 156 -99 257 1 -999;run;*hard code;data test1a; set test1; if female = -999 then female=.a; if female = -99 then female = .b; if ses = -999 then ses = .a;run;proc print data = test1a;run;

Obs score female ses 1 56 1 1 2 62 1 2 3 73 0 3 4 67 A 1 5 57 0 1 6 56 B 2 7 57 1 A *using the array;data test1b;

set test1; array miss(2) female ses; do i = 1 to 2; if miss(i) = -999 then miss(i) =.a; if miss(i) = -99 then miss(i) =.b; end;drop i;run;proc print data = test1b;run;

Obs score female ses

1 56 1 1 2 62 1 2 3 73 0 3 4 67 A 1 5 57 0 1 6 56 B 2 7 57 1 A

We should notice that when SAS prints a special missing value, it prints only the letter or underscore, not the dot ".".

Example 2: Specifying types of missing values in a raw data file

We have a tiny example raw data file called tiny.txt with three variables shown below. The variables are score, female and ses. These three variables are meant to be numeric, except that we have special characters for missing values. For example, in this example, "a" means that the subject refused to give the information and "b" means data entry error. Notice that valid characters here are 26 letters, a-z and underscore "_".

56 1 1 62 1 2 73 0 367 a 157 0 156 1 257 1 b

We want to read the variables as numeric and we also want to keep the information on the nature of missing values. In SAS, we can read these variables as numeric from this file by using the missingstatement in the data step. Here is how we can do it:

data test0; missing a b; infile 'd:\temp\missing.txt'; input score female ses ;run;proc print data = test0;

run;

Obs score female ses 1 56 1 1 2 62 1 2 3 73 0 3 4 67 A 1 5 57 0 1 6 56 1 2 7 57 1 B

There are then two types of missing data type in the data set test0: .A and .B. For example, when we want to refer to the 4th observation where value for variable female is missing, we can use wherestatement such as "where female=.a;" as shown in the following example:

proc print data = test0; where female=.a;run;Obs score female ses 4 67 A 1

SAS FAQHow can I create different kinds of centered variables in SAS?

Centering a variable means that a constant has been subtracted from every value of a variable. There are several ways that you can center variables. For example, you could center the variable around a constant that has intrinsic meaning for the variable, such as centering a continuous variable age around 18 to represent when Americans come of voting age. You could also center a variable around its mean, or you could use a categorical variable to group your continuous variable, and get means for each group. Each of these techniques is shown below.

We will use the test data set presented below for all of our examples. We understand that for most purposes such a data set is unrealistically small, but its size makes it easier to see what is happening in each step.

data test;input studentid class score1 score2;cards;1 1 34 242 1 39 253 1 34 264 1 38 205 1 32 211 2 45 362 2 43 303 2 48 394 2 41 375 2 40 311 3 50 462 3 51 493 3 57 484 3 50 405 3 57 46;run;

1. Centering a variable around a constant

Suppose that we wanted to center all of the values in the variable score1 around 45.

data center45;set test;c45 = score1 - 45;run;

proc print data = center45;run;Obs studentid class score1 score2 c45

1 1 1 34 24 -11

2 1 2 45 36 0 3 1 3 50 46 5 4 2 1 39 25 -6 5 2 2 43 30 -2 6 2 3 51 49 6 7 3 1 34 26 -11 8 3 2 48 39 3 9 3 3 57 48 12 10 4 1 38 20 -7 11 4 2 41 37 -4 12 4 3 50 40 5 13 5 1 32 21 -13 14 5 2 40 31 -5 15 5 3 57 46 12

Now let's center the scores for each class around a different constant. Let's suppose that score1 for class 1 should be centered around 30, for class 2 the scores should centered around 40, and for class 3 the scores should centered around 50. The proc sort was added only to make the output easier to read; it is not necessary for the program to work.

data centerdiff;set test;if class = 1 then c1 = score1 - 30;if class = 2 then c1 = score1 - 40;if class = 3 then c1 = score1 - 50;run;

proc sort data = centerdiff;by class studentid;run;

proc print data = centerdiff;run;Obs studentid class score1 score2 c1

1 1 1 34 24 4 2 2 1 39 25 9 3 3 1 34 26 4 4 4 1 38 20 8 5 5 1 32 21 2 6 1 2 45 36 5 7 2 2 43 30 3 8 3 2 48 39 8 9 4 2 41 37 1 10 5 2 40 31 0 11 1 3 50 46 0 12 2 3 51 49 1 13 3 3 57 48 7 14 4 3 50 40 0 15 5 3 57 46 7

2. Grand mean centering

Instead of centering a variable around a value that you select, you may want to center it around its mean. This is known as grand mean centering. There are at least three ways that you can do this. Perhaps the most straight-forward way is to get the mean of each variable that you wan to center and subtract that value from the variable in a data step. This is simple if you only need to center a few variables.

proc means data = test mean;var score1 score2;run;Variable Mean------------------------score1 43.9333333score2 34.5333333------------------------

data grand;set test;grmscore1 = score1 - 43.93;grmscore2 = score2 - 34.53;run;

proc print data = grand;run;Obs studentid class score1 score2 grmscore1 grmscore2

1 1 1 34 24 -9.93 -10.53 2 2 1 39 25 -4.93 -9.53 3 3 1 34 26 -9.93 -8.53 4 4 1 38 20 -5.93 -14.53 5 5 1 32 21 -11.93 -13.53 6 1 2 45 36 1.07 1.47 7 2 2 43 30 -0.93 -4.53 8 3 2 48 39 4.07 4.47 9 4 2 41 37 -2.93 2.47 10 5 2 40 31 -3.93 -3.53 11 1 3 50 46 6.07 11.47 12 2 3 51 49 7.07 14.47 13 3 3 57 48 13.07 13.47 14 4 3 50 40 6.07 5.47 15 5 3 57 46 13.07 11.47

A second way to create a grand mean centered variable is to use proc means, output the means to a data set, and then merge that data set with your original data set. This is illustrated below. The data set outputted from the proc means is shown below. As you can see, it has only one observation. The other thing to notice about this data set is that it has no variables in common with the original data set. This makes merging it with the original data set somewhat more difficult. The steps needed to overcome this problem are explained just above the data set that performs the merge.

proc means data = test mean;var score1 score2;

output out = grand1 mean=m1 m2;run;

proc print data = grand1;run;Obs _TYPE_ _FREQ_ m1 m2

1 0 15 43.9333 34.5333proc sort data = test;by studentid class;run;

If you try to merge the grand1 data set and the original test data set as you normally would, you will find that you have the values of m1 and m2 only for the first case, and missing values for the remaining 14 cases. Hence, we need to use a do loop to assign the values of m1 and m2 to new variables, which we have called mean1 and mean2. Also, we need to use the retain statement to retain the values ofmean1 and mean2 so that their values are not set to missing when the data step iterates the second time. We cannot just retain m1 and m2, because that would be altering their values as we read them into the grand1merged data set, which is not allowed. We use the drop statement to drop the variables m1 and m2, as well as the _type_ and _freq_ variables that were in the grand1 data set. Finally, we calculate the grand mean centered variables that we want, grmscore1 and grmscore2.

data grand1merged;merge test grand1;retain mean1 mean2;if _n_ = 1 then do;mean1 = m1; mean2 = m2; end;drop _freq_ _type_ m1 m2;grmscore1 = score1 - mean1;grmscore2 = score2 - mean2;run;

proc print data = grand1merged;run;Obs studentid class score1 score2 mean1 mean2 grmscore1 grmscore2

1 1 1 34 24 43.9333 34.5333 -9.9333 -10.5333 2 1 2 45 36 43.9333 34.5333 1.0667 1.4667 3 1 3 50 46 43.9333 34.5333 6.0667 11.4667 4 2 1 39 25 43.9333 34.5333 -4.9333 -9.5333 5 2 2 43 30 43.9333 34.5333 -0.9333 -4.5333

6 2 3 51 49 43.9333 34.5333 7.0667 14.4667 7 3 1 34 26 43.9333 34.5333 -9.9333 -8.5333 8 3 2 48 39 43.9333 34.5333 4.0667 4.4667 9 3 3 57 48 43.9333 34.5333 13.0667 13.4667 10 4 1 38 20 43.9333 34.5333 -5.9333 -14.5333 11 4 2 41 37 43.9333 34.5333 -2.9333 2.4667 12 4 3 50 40 43.9333 34.5333 6.0667 5.4667 13 5 1 32 21 43.9333 34.5333 -11.9333 -13.5333 14 5 2 40 31 43.9333 34.5333 -3.9333 -3.5333 15 5 3 57 46 43.9333 34.5333 13.0667 11.4667

In the code below, four new variables are created: mean1 is the mean of score1, mean2 is the mean of score2, grandmc1 is the grand mean centered variable for score1 and grandmc2 is the grand mean centered variable for score2.

* grand mean centering using proc sql;proc sql; create table grndmc asselect *, mean(score1) as mean1, mean(score2) as mean2,score1 - mean(score1) as grandmc1, score2 - mean(score2) as grandmc2from test;quit;

proc print data = grndmc;run;Obs studentid class score1 score2 mean1 mean2 grandmc1 grandmc2

1 1 1 34 24 43.9333 34.5333 -9.9333 -10.5333 2 1 2 45 36 43.9333 34.5333 1.0667 1.4667 3 1 3 50 46 43.9333 34.5333 6.0667 11.4667 4 2 1 39 25 43.9333 34.5333 -4.9333 -9.5333 5 2 2 43 30 43.9333 34.5333 -0.9333 -4.5333 6 2 3 51 49 43.9333 34.5333 7.0667 14.4667 7 3 1 34 26 43.9333 34.5333 -9.9333 -8.5333 8 3 2 48 39 43.9333 34.5333 4.0667 4.4667

9 3 3 57 48 43.9333 34.5333 13.0667 13.4667 10 4 1 38 20 43.9333 34.5333 -5.9333 -14.5333 11 4 2 41 37 43.9333 34.5333 -2.9333 2.4667 12 4 3 50 40 43.9333 34.5333 6.0667 5.4667 13 5 1 32 21 43.9333 34.5333 -11.9333 -13.5333 14 5 2 40 31 43.9333 34.5333 -3.9333 -3.5333 15 5 3 57 46 43.9333 34.5333 13.0667 11.4667

3. Creating an aggregate variable

There may be times when you want to create an aggregate variable. An aggregate variable is one that aggregates data from a "lower level" to a "higher level". In this example, the students' test scores (which can be thought of as a level 1 variable) are aggregated to the classroom level (which can be thought of as a level 2 variable). Hence, a new variable is created that is the mean of the test scores for each class.

In the code below, the output statement is used to output the means for each variable (in this case, score1 and score2) to a new data set called aggtest. The means for score1 are put into a variable calledm1 and the means for score2 are put into a variable called m2.

proc means data = test mean ;var score1 score2;by class;output out = aggtest mean=m1 m2;run;

proc print data = aggtest;run;Obs class _TYPE_ _FREQ_ m1 m2

1 1 0 5 35.4 23.2 2 2 0 5 43.4 34.6 3 3 0 5 53.0 45.8

proc sort data = test;by class;run;

data merged;merge test aggtest;by class;drop _TYPE_ _FREQ_;run;

proc print data = merged;run;Obs studentid class score1 score2 m1 m2

1 1 1 34 24 35.4 23.2 2 2 1 39 25 35.4 23.2 3 3 1 34 26 35.4 23.2 4 4 1 38 20 35.4 23.2 5 5 1 32 21 35.4 23.2 6 1 2 45 36 43.4 34.6 7 2 2 43 30 43.4 34.6 8 3 2 48 39 43.4 34.6 9 4 2 41 37 43.4 34.6 10 5 2 40 31 43.4 34.6 11 1 3 50 46 53.0 45.8 12 2 3 51 49 53.0 45.8 13 3 3 57 48 53.0 45.8 14 4 3 50 40 53.0 45.8 15 5 3 57 46 53.0 45.8

You can do the same thing using proc sql. In the code below, a data set called aggtestsql is created. In the third line, you can see the mean of score1 is created in stored in a variable called mean1, and the mean for score2 is created and stored in a variable called mean2. The group by statement is needed so that the means are by groups, in this case, the variable class. If this statement was omitted, the means created would be grand means (in other words, means for the whole variable not broken out by classes).

proc sql; create table aggtestsql asselect *, mean(score1) as mean1, mean(score2) as mean2 from testgroup by class;quit;

proc print data = aggtestsql;run;Obs studentid class score1 score2 mean1 mean2

1 1 1 34 24 35.4 23.2 2 2 1 39 25 35.4 23.2 3 3 1 34 26 35.4 23.2 4 4 1 38 20 35.4 23.2 5 5 1 32 21 35.4 23.2 6 1 2 45 36 43.4 34.6 7 2 2 43 30 43.4 34.6 8 3 2 48 39 43.4 34.6 9 4 2 41 37 43.4 34.6 10 5 2 40 31 43.4 34.6 11 1 3 50 46 53.0 45.8 12 2 3 51 49 53.0 45.8 13 3 3 57 48 53.0 45.8 14 4 3 50 40 53.0 45.8 15 5 3 57 46 53.0 45.8

4. Group mean centering

Just as there are at least three ways to create a grand mean centered variable, there are at least three different ways to create a group mean centered variable. The first way illustrated below is very straight-forward, but it may be impractical if you have lots of groups (or classes). To save space, we have only group mean centered one variable, score1.

proc means data = test mean;by class;var score1;run;class=1

The MEANS Procedure

Analysis Variable : score1

Mean------------ 34.0000000------------

class=2

Analysis Variable : score1

Mean------------ 45.0000000------------

data group;set test;if class = 1 then grpmscore1 = score1 - 35.4;if class = 2 then grpmscore1 = score1 - 43.4;if class = 3 then grpmscore1 = score1 - 53.0;run;

proc print data = group;run;Obs studentid class score1 score2 grpmscore1

1 1 1 34 24 -1.4 2 1 2 45 36 1.6 3 1 3 50 46 -3.0 4 2 1 39 25 3.6 5 2 2 43 30 -0.4 6 2 3 51 49 -2.0 7 3 1 34 26 -1.4 8 3 2 48 39 4.6 9 3 3 57 48 4.0 10 4 1 38 20 2.6 11 4 2 41 37 -2.4

12 4 3 50 40 -3.0 13 5 1 32 21 -3.4 14 5 2 40 31 -3.4 15 5 3 57 46 4.0

A second way to create a group mean centered variable is to use proc means, output the means to a data set, and then merge that data set with your original data set. This is shown below.

proc means data = test mean;var score1 score2;by class;output out = grpmeanctr mean=m1 m2;run;

proc sort data = test;by class studentid;run;

data merged2;merge test grpmeanctr;by class;drop _TYPE_ _FREQ_;groupmc1 = score1 - m1;groupmc2 = score2 - m2;run;

proc print data = merged2;run;Obs studentid class score1 score2 m1 m2 groupmc1 groupmc2

1 1 1 34 24 35.4 23.2 -1.4 0.8 2 2 1 39 25 35.4 23.2 3.6 1.8 3 3 1 34 26 35.4 23.2 -1.4 2.8 4 4 1 38 20 35.4 23.2 2.6 -3.2 5 5 1 32 21 35.4 23.2 -3.4 -2.2 6 1 2 45 36 43.4 34.6 1.6 1.4 7 2 2 43 30 43.4 34.6 -0.4 -4.6 8 3 2 48 39 43.4 34.6 4.6 4.4 9 4 2 41 37 43.4 34.6 -2.4 2.4 10 5 2 40 31 43.4 34.6 -3.4 -3.6 11 1 3 50 46 53.0 45.8 -3.0 0.2

12 2 3 51 49 53.0 45.8 -2.0 3.2 13 3 3 57 48 53.0 45.8 4.0 2.2 14 4 3 50 40 53.0 45.8 -3.0 -5.8 15 5 3 57 46 53.0 45.8 4.0 0.2

A third way to accomplish the same thing is to use proc sql. As before, four new variables are being created. You do not have to create the mean1 and mean2 variables; we have included them only for the sake of completeness and to show how this would be done.

proc sql; create table grpmeanctrsql asselect *, mean(score1) as mean1, mean(score2) as mean2,score1 - mean(score1) as groupmc1, score2 - mean(score2) as groupmc2from testgroup by class;quit;

proc print data = grpmeanctrsql;run;Obs studentid class score1 score2 mean1 mean2 groupmc1 groupmc2

1 1 1 34 24 35.4 23.2 -1.4 0.8 2 2 1 39 25 35.4 23.2 3.6 1.8 3 3 1 34 26 35.4 23.2 -1.4 2.8 4 4 1 38 20 35.4 23.2 2.6 -3.2 5 5 1 32 21 35.4 23.2 -3.4 -2.2 6 1 2 45 36 43.4 34.6 1.6 1.4 7 2 2 43 30 43.4 34.6 -0.4 -4.6 8 3 2 48 39 43.4 34.6 4.6 4.4 9 4 2 41 37 43.4 34.6 -2.4 2.4 10 5 2 40 31 43.4 34.6 -3.4 -3.6 11 1 3 50 46 53.0 45.8 -3.0 0.2 12 2 3 51 49 53.0 45.8 -2.0 3.2 13 3 3 57 48 53.0 45.8 4.0 2.2 14 4 3 50 40 53.0 45.8 -3.0 -5.8

15 5 3 57 46 53.0 45.8 4.0 0.2

SAS FAQ How can I create an enumeration variable by groups?

There are occasions, especially with survey data, when you need to create an enumeration (also called a counting or identification) variable that starts at one for each group in your data. For example, suppose that you have test scores for students in a class. You may need to create a variable that counts all of the males in the class, and then starts at one and counts all of the females in the class. Let's look at a small data set and see how this can be easily done.

data students; input gender score; cards; 1 48 1 45 2 50 2 42 1 41 2 51 1 52 1 43 2 52 ;run;

First, we need to sort the data on the grouping variable, in this case, gender.

proc sort data = students; by gender;run;

Next, we will create a new variable called count that will count the number of males and the number of females.

data students1; set students; count + 1; by gender; if first.gender then count = 1;run;

Let's consider some of the code above and explain what it does and why. The third statement, count + 1, creates the variable count and adds one to each observation as SAS processes the data step. There is an implicit retain statement in this statement. This is why SAS does not reset the value of count to missing before processing the next observation in the data set. The next statement tells SAS the grouping variable.

In this example, the grouping variable is gender. The data set must be sorted by this variable before running this data step. The next statement tells SAS when to reset the count and to what value to reset the counter. SAS has two built-in keywords that are useful in situations like these: first. and last. (pronounced "first-dot" and "last-dot"). Note that the period is part of the keyword. The variable listed after the first. keyword is the grouping variable. If we wanted SAS to do something when it came to the last observation in the group, we would use the last. keyword. The last part of the statement is straightforward: after the keyword then we list the name of the variable that we want and set it equal to the value that we want to be assigned to the first observation in the group. In this example, we wanted to start counting at one, but you could put any number there that meets your needs. Now let's see what our new data set looks like.

proc print data = students1;run;Obs gender score count

1 1 48 1 2 1 45 2 3 1 41 3 4 1 52 4 5 1 43 5 6 2 50 1 7 2 42 2 8 2 51 3 9 2 52 4

As you can see, the process worked as we desired.

Now let's look at a slightly more complicated example. Suppose that we had two grouping variables, class and gender.

data two; input class gender score; cards; 1 1 48 1 1 45 2 2 50 1 2 42 2 1 41 2 2 51 2 1 52 1 1 43 1 2 52 ;run;

proc sort data = two; by class gender;run;

data two1; set two; count + 1; by class gender; if first.class or first.gender then count = 1;run;

proc print data = two1;run;Obs class gender score count

1 1 1 48 1 2 1 1 45 2 3 1 1 43 3 4 1 2 42 1 5 1 2 52 2 6 2 1 41 1 7 2 1 52 2 8 2 2 50 1 9 2 2 51 2

As you can see, expanding the code to handle multiple layers is simple. Also, although we have only two levels in our grouping variables, the number of levels within any of the grouping variables does not matter.

SAS FAQHow can I create tables using proc tabulate?

Proc tabulate is predominately used to make nice looking tables of summary statistics. This procedure is often used to create tables to be used in publications because it allows for a great deal of manipulation and control over almost every aspect of the table. It is a very versatile procedure that allows you to make simple two by two summary tables or customized multipage reports and everything in between. A few features include customizing the number, type and position of summary statistics. You can also customize the appearance of tables by specifying text position, size, font and color.

This page will show you how to use proc tabulate to make a wide variety of tables. We will use the hsb2 dataset which contains demographic and academic data about 200 high school students.

Table using proc means

Lets say that we need to present a table of mean writing scores by socioeconomic status and gender. Below we use proc means to create such a table.

proc means data=hsb2 mean;class female ses;var write;run; N female ses Obs Mean--------------------------------------------------- 0 1 15 46.6000000

2 47 49.5531915

3 29 52.8620690

1 1 32 52.5000000

2 48 54.2500000

3 29 58.9655172---------------------------------------------------

Table of means using proc tabulate

The table above displays the information that we need and we could even make it easier to read by using value labels. However, the only control over the layout of the table is the ordering of the classification variables. For example we could change the position of female and ses in the output by switching their order in the class statement. On the other had we could use proc tabulate which will give us the same information with the added benefit of being able to arrange the table in the most appropriate way. Now lets try using proc tabulate to create a table with the same information as the one above. The following table is a basic example that specifies one row variable (female), one column variable (ses), one analysis variable (write) and one summary statistic (mean). Note that if we don't specify the mean as the summary statistic our table will contain the sum because it is the default summary statistic.

proc tabulate data =hsb2; class ses female; var write; table female*mean, write*ses;run;------------------------------------------------------------------------| | write || |--------------------------------------|| | ses || |--------------------------------------|| | 1 | 2 | 3 ||-------------------------------+------------+------------+------------||female | | | | ||---------------+---------------| | | ||0 |Mean | 46.60| 49.55| 52.86||---------------+---------------+------------+------------+------------||1 |Mean | 52.50| 54.25| 58.97|------------------------------------------------------------------------

The table above presents the information in a layout that is easier to read than the proc means output. Now we have a table with a convenient, easy to read layout that displays the mean. Lets say that we would prefer to move gender and the mean to the columns and move only socioeconomic status to the rows. As you can see this is a much different layout than the table above with all the same information.

proc tabulate data =hsb2; class ses female; var write; table ses*mean, write*female;run;-----------------------------------------------------------| | Mean |

| |-------------------------|| | write || |-------------------------|| | female || |-------------------------|| | 0 | 1 ||-------------------------------+------------+------------||ses | | ||-------------------------------| | ||1 | 46.60| 52.50||-------------------------------+------------+------------||2 | 49.55| 54.25||-------------------------------+------------+------------||3 | 52.86| 58.97|-----------------------------------------------------------

From the two examples above, we see how the table statement works. It has four basic parts:

layout for the rows, defined in the class statement; (ses variable, in the above example)

layout for the columns, defined in the class statement; (female variable, in the above example)

the type of summary statistic ; (mean) the name of the variable that the summary statistic is taken from, defined

via var statement; (write)

Very important programming note: The ordering of the variables and statistics in the table statement can drastically change the layout and appearance of your table. In the table statement the comma separates the rows from the columns, anything before the comma will be part of the rows and anything after the comma will be part of the columns.

Refining tables with labels and formats

Now that we have seen the basics lets use a few options to refine the presentation of the table. We will start by making the headings more compact with easier to read titles. The example below shows how to change heading labels to make the table more reader friendly. Instead of letting SAS choose the variable name as the default label we can add more informative headings by setting each variable equal to a new title. We have changed ses to 'Socioeconomic Status' and write to 'Mean writing score'. Also notice that we can delete the heading 'Mean' by using a blank label.

proc tabulate data = hsb2; class ses female; var write; table ses='Socioeconomic Status'*mean,

write='Mean writing score'*female;run;-----------------------------------------------------------| | Mean writing score || |-------------------------|| | female || |-------------------------|| | 0 | 1 ||-------------------------------+------------+------------||Socioeconomic Status | | ||-------------------------------| | ||1 | 46.60| 52.50||-------------------------------+------------+------------||2 | 49.55| 54.25||-------------------------------+------------+------------||3 | 52.86| 58.97|-----------------------------------------------------------

The table above looks good but could still use some improvement. It is difficult to understand what we are looking at because we don't know what ses = 1 means. It would be much more convenient to use formats to change the numeric values to value labels.

proc format; value fm 1='Female' 0='Male' ; value ses 1='Low' 2='Middle' 3='High';run;proc tabulate data=hsb2; class ses female; var write; table ses='Socioeconomic Status', mean=' '*write='Mean writing score'*female=''; format female fm. ses ses.;run;-----------------------------------------------------------| | Mean writing score || |-------------------------|| | Gender || |-------------------------|| | Male | Female ||-------------------------------+------------+------------||Socioeconomic Status | | ||-------------------------------| | ||Low | 46.60| 52.50||-------------------------------+------------+------------||Middle | 49.55| 54.25||-------------------------------+------------+------------||High | 52.86| 58.97|-----------------------------------------------------------

The table above looks great and is very easy to read. We can easily see that the average reading score for males from a low socioeconomic status is 46.60. Additionally, we can make the same table in a more compact fashion by removing all

of the headings and adding one table heading in the upper left hand corner. Notice that this example also uses the rtspace= option. This option allows you to specify the number of spaces for row titles to ensure that the whole title is printed in the table.

proc tabulate data = hsb2; class ses female; var write; table ses=''*mean, write=''*female='' / box=[label="Mean of writing Score by ses and female" style=[font_style=italic]] rtspace=42; format female fm. ses ses.;run;--------------------------------------------------------------------|Mean of writing Score by ses and female | Male | Female ||----------------------------------------+------------+------------||Low | 46.60| 52.50||----------------------------------------+------------+------------||Middle | 49.55| 54.25||----------------------------------------+------------+------------||High | 52.86| 58.97|--------------------------------------------------------------------

Tables with multiple analysis variables and multiple statistics

Now that we have seen the basics of formatting lets try making tables that display more information. Lets say that we want to present a table with both the mean math and the mean writing scores. The following is one of many ways to present this information. We decided on this particular layout because it is most important to compare males and females for each socioeconomic groups by subject.

proc tabulate data = hsb2; class ses female; var write math; table ses*mean*(write math), female; format female fm. ses ses.;run;-----------------------------------------------------------| | female || |-------------------------|| | Male | Female ||-------------------------------+------------+------------||ses | | | | ||---------+---------+-----------| | ||Low |Mean |write | 46.60| 52.50|| | |-----------+------------+------------|| | |math | 47.60| 49.91||---------+---------+-----------+------------+------------||Middle |Mean |write | 49.55| 54.25|| | |-----------+------------+------------|| | |math | 53.47| 50.98|

|---------+---------+-----------+------------+------------||High |Mean |write | 52.86| 58.97|| | |-----------+------------+------------|| | |math | 54.86| 57.48|-----------------------------------------------------------

Lets say that we would like to present both the mean and standard deviation of write in a single table.

proc tabulate data = hsb2; class ses female; var write math; table ses*(mean std), write*female / box=[label="mean and standard deviation by ses and female"]; format female fm. ses ses.;run;-----------------------------------------------------------|mean and standard deviation by | write ||ses and female |-------------------------|| | female || |-------------------------|| | Male | Female ||-------------------------------+------------+------------||ses | | | ||---------------+---------------| | ||Low |Mean | 46.60| 52.50|| |---------------+------------+------------|| |Std | 9.03| 9.24||---------------+---------------+------------+------------||Middle |Mean | 49.55| 54.25|| |---------------+------------+------------|| |Std | 10.16| 7.33||---------------+---------------+------------+------------||High |Mean | 52.86| 58.97|| |---------------+------------+------------|| |Std | 10.78| 6.79|-----------------------------------------------------------

Now lets combine the two tables above and present the mean and standard deviation of both math and write.

proc tabulate data = hsb2; class ses female; var write math; table ses*(mean std), (write math)*female / box=[label="mean and standard deviation by ses and female"]; format female fm. ses ses.;run;-------------------------------------------------------------------------------------|mean and standard deviation by | write | math |

|ses and female |-------------------------+-------------------------|| | female | female || |-------------------------+-------------------------|| | Male | Female | Male | Female ||-------------------------------+------------+------------+------------+------------||ses | | | | | ||---------------+---------------| | | | ||Low |Mean | 46.60| 52.50| 47.60| 49.91|| |---------------+------------+------------+------------+------------|| |Std | 9.03| 9.24| 6.78| 9.72||---------------+---------------+------------+------------+------------+------------||Middle |Mean | 49.55| 54.25| 53.47| 50.98|| |---------------+------------+------------+------------+------------|| |Std | 10.16| 7.33| 10.57| 7.92||---------------+---------------+------------+------------+------------+------------||High |Mean | 52.86| 58.97| 54.86| 57.48|| |---------------+------------+------------+------------+------------|| |Std | 10.78| 6.79| 8.62| 8.72|-------------------------------------------------------------------------------------

Multi-way tables with stacked and nested variables

Proc tabulate is capable of producing sophisticated multi-level tables. Lets say that we would like to present the mean writing and math scores by socioeconomic status and also by program type. The following table shows how to stack variables ses and prog. Notice in the following program that ses and prog are connected individually to write and math. This ensures that the variables will appear consecutively or stacked in the table.

proc format; value fm 1='Female' 0='Male' ; value ses 1='Low' 2='Middle' 3='High'; value prog 1='General' 2='Academic' 3='Vocational';run;

proc tabulate data = tab.hsb2; class ses female prog; var write math; table ses='Socioeconomic Status'*(write='Write' math='Math') prog='Program Type'*(write='Write' math='Math'), mean*female=''; format female fm. ses ses. prog prog.;run;-----------------------------------------------------------| | Mean || |-------------------------|| | Male | Female ||-------------------------------+------------+------------||Socioeconomic | | | ||Status | | | ||---------------+---------------| | ||Low |Write | 46.60| 52.50|| |---------------+------------+------------|| |Math | 47.60| 49.91||---------------+---------------+------------+------------||Middle |Write | 49.55| 54.25|| |---------------+------------+------------|| |Math | 53.47| 50.98||---------------+---------------+------------+------------||High |Write | 52.86| 58.97|| |---------------+------------+------------|| |Math | 54.86| 57.48||---------------+---------------+------------+------------||Program Type | | | ||---------------+---------------| | ||General |Write | 49.14| 53.25|| |---------------+------------+------------|| |Math | 50.19| 49.88||---------------+---------------+------------+------------||Academic |Write | 54.62| 57.59|| |---------------+------------+------------|| |Math | 57.13| 56.41||---------------+---------------+------------+------------||Vocational |Write | 41.83| 50.96|| |---------------+------------+------------|| |Math | 46.91| 46.00|-----------------------------------------------------------

We still want to present the mean writing and math scores by socioeconomic status and program type. The following table shows program type nested in socioeconomic status. Notice that in the program below ses and prog are connected by an *. The connection ensures that these two variables will appear nested in the table and that the cells of the table will represent more specific groups. The flexibility ofproc tabulate allows complex nesting structures and wide variety of layouts.

proc tabulate data = tab.hsb2; class ses female prog; var write math; table ses='Socioeconomic Status'*prog='Program Type'*(mean std),

(write='Write' math='Math')*female=''/ rts=43 ; format female fm. ses ses. prog prog.;run;-----------------------------------------------------------------------------------------------| | Write | Math || |-------------------------+-------------------------|| | Male | Female | Male | Female ||-----------------------------------------+------------+------------+------------+------------||Socioeconomic|Program Type | | | | | ||Status | | | | | | ||-------------+-------------+-------------| | | | ||Low |General |Mean | 48.29| 53.89| 46.71| 48.22|| | |-------------+------------+------------+------------+------------|| | |Std | 6.05| 6.92| 8.12| 7.55|| |-------------+-------------+------------+------------+------------+------------|| |Academic |Mean | 51.25| 54.27| 50.00| 55.00|| | |-------------+------------+------------+------------+------------|| | |Std | 11.79| 9.50| 6.68| 10.28|| |-------------+-------------+------------+------------+------------+------------|| |Vocational |Mean | 39.00| 47.63| 46.75| 42.25|| | |-------------+------------+------------+------------+------------|| | |Std | 7.48| 10.32| 5.25| 3.92||-------------+-------------+-------------+------------+------------+------------+------------||Middle |General |Mean | 47.20| 52.00| 51.10| 51.10|| | |-------------+------------+------------+------------+------------|| | |Std | 10.67| 8.25| 9.27| 6.03|| |-------------+-------------+------------+------------+------------+------------|| |Academic |Mean | 55.59| 57.14| 58.14| 54.50|| | |-------------+------------+------------+------------+------------|| | |Std | 6.65| 6.44| 10.09| 7.26|

| |-------------+-------------+------------+------------+------------+------------|| |Vocational |Mean | 42.27| 51.69| 48.20| 46.06|| | |-------------+------------+------------+------------+------------|| | |Std | 9.02| 6.85| 9.54| 7.54||-------------+-------------+-------------+------------+------------+------------+------------||High |General |Mean | 55.50| 54.60| 54.00| 50.40|| | |-------------+------------+------------+------------+------------|| | |Std | 15.26| 11.46| 4.32| 7.70|| |-------------+-------------+------------+------------+------------+------------|| |Academic |Mean | 54.24| 60.43| 57.43| 59.43|| | |-------------+------------+------------+------------+------------|| | |Std | 10.08| 4.55| 7.74| 8.15|| |-------------+-------------+------------+------------+------------+------------|| |Vocational |Mean | 43.00| 56.00| 42.25| 55.67|| | |-------------+------------+------------+------------+------------|| | |Std | 4.55| 9.64| 3.95| 10.50|-----------------------------------------------------------------------------------------------

SAS FAQHow can I change the way variables are displayed in proc freq?

The following program builds a data called auto, which we will use for our examples.

DATA auto ; LENGTH make $ 20 ; INPUT make $ 1-17 price mpg rep78 modtype ;CARDS;AMC Concord 4099 22 3 2AMC Pacer 4749 17 3 2Audi 5000 9690 17 5 3Audi Fox 6295 23 3 1BMW 320i 9735 25 4 1Buick Century 4816 20 3 2Buick Electra 7827 15 4 2Buick LeSabre 5788 18 3 2Cad. Eldorado 14500 14 2 3Olds Starfire 4195 24 1 2Olds Toronado 10371 16 3 2Plym. Volare 4060 18 2 2Pont. Catalina 5798 18 4 2Pont. Firebird 4934 18 1 1Pont. Grand Prix 5222 19 3 2Pont. Le Mans 4723 19 3 1;RUN;

Proc freq prints frequencies in ascending order, as determined by variable value. For example, we request frequencies for modtype below and the table shows modeltype 1 followed by modtype 2 followed by modtype 3.

PROC FREQ DATA=auto; TABLES modtype;RUN; Cumulative CumulativeMODTYPE Frequency Percent Frequency Percent-----------------------------------------------------

1 4 25.0 4 25.0 2 10 62.5 14 87.5 3 2 12.5 16 100.0

Suppose we create value format labels to add descriptive text for the three models of cars using proc format, as illustrated below.

PROC FORMAT; VALUE typefmt 1 = 'Sporty ' 2 = 'Midsize' 3 = 'Luxury' ;RUN;

Our tables will print as follows:

PROC FREQ DATA = auto; TABLES modtype; FORMAT modtype typefmt.; RUN; Cumulative CumulativeMODTYPE Frequency Percent Frequency Percent-----------------------------------------------------Sporty 4 25.0 4 25.0Midsize 10 62.5 14 87.5Luxury 2 12.5 16 100.0

The table still prints in an order determined by the actual variable value. Sporty cars are displayed first because the variable is coded 1. We can, however, change the order the frequencies are displayed by using the order = formatted option. This option prints frequencies in alphabetical order as determined by the formatted value, as illustrated below.

PROC FREQ DATA=auto ORDER=FORMATTED; TABLES modtype; FORMAT modtype typefmt.;RUN; Cumulative CumulativeMODTYPE Frequency Percent Frequency Percent-----------------------------------------------------Luxury 2 12.5 2 12.5Midsize 10 62.5 12 75.0Sporty 4 25.0 16 100.0

You can use the order = freq option if you want to print values in descending order of frequency.

PROC FREQ DATA= auto ORDER=FREQ; TABLES modtype; FORMAT modtype typefmt.;RUN;

Cumulative CumulativeMODTYPE Frequency Percent Frequency Percent-----------------------------------------------------Midsize 10 62.5 10 62.5Sporty 4 25.0 14 87.5Luxury 2 12.5 16 100.0

SAS FAQHow can I direct the output from PC SAS to a file?

When running SAS programs interactively through the display manager, the output from any procedure is written to the Output window and notes, warnings and errors are written to the Log Window. Contents of these windows are temporary. They can be saved to a file using the File Save pulldown menus from the Output Window and from the Log Window. But if you want to make sure that the output of these windows is saved to a file every time, you can use proc printto to automatically route output to a file.

For example, the following program routes the output from proc printto directly to a file named auto.lst. What would have gone to the Output Window is redirected to the file c:\auto.lst .

PROC PRINTTO PRINT='c:\auto.lst' NEW;RUN;

PROC PRINT DATA=auto; VAR make price ;RUN;

PROC PRINTTO PRINT=PRINT;RUN;

Let's look at each statement more closely.

The statements below tell SAS to send output that would go to the Output Window to the file c:\auto.lst and to create a new file if the file currently exists. If the new option was omitted, SAS would append to the file if it existed.

PROC PRINTTO PRINT='c:\auto.lst' NEW;RUN;

This is just a regular proc print, the results of which will go to c:\auto.lst.


The statement below tells SAS to direct any subsequent output to the default destination (in SAS Display Manager, output will be directed back to the Output Window).

PROC PRINTTO PRINT=PRINT;RUN;

It is also possible to change the destination for the SAS Log as well, as shown in the example below. This example redirects the log to c:\auto.log . The log information from the proc print is directed toc:\auto.log, and then the destination for the log is returned back to its default (in SAS Display manger, the log will be directed back to the Log Window).

PROC PRINTTO LOG='c:\auto.log' NEW;RUN;


PROC PRINTTO LOG=LOG;RUN;

Finally, you can change the destination for the SAS Output Window and Log Window at the same time, as illustrated below.

PROC PRINTTO PRINT='c:\auto.lst' LOG='c:\auto.log' NEW;RUN;


PROC PRINTTO PRINT=PRINT LOG=LOG ;RUN;

While these examples have focused on SAS Display Manager for the PC, these same commands could be used in SAS Display Manager on any platform. In fact, this could be used in SAS in batch mode as well. For example, it can be convenient to use proc printto to direct the output of a group of statements to one file, then to direct the output from another group of statements to another file.

SAS FAQ How can I export my SAS results to an Excel spreadsheet?

SAS output is rarely the form in which results are presented. Many create results tables in Excel. This page will provide an example of how to generate a multi-tab spreadsheet containing SAS results. We will be using the Output Delivery System (ODS) to do so. ODS allows you to generate tabular output from your raw output that can be placed into Excel sheets. In the code below, we are creating an Excel file (giving it a name and location), indicating a style to be used ("minimal" in this example), and specifying a few other options.

ODS TAGSETS.EXCELXPfile='D:\work\sas9\regression.xls'STYLE=minimalOPTIONS ( Orientation = 'landscape'FitToPage = 'yes'Pages_FitWidth = '1'Pages_FitHeight = '100' );

After this code, we can analyze our data and a new tab will be created for each separate ODS table from your output.

proc reg data = ats.hsb2; model write = female math read;run;quit;proc format; value fm 1='Female' 0='Male' ;

value ses 1='Low' 2='Middle' 3='High';run;

After completing our analysis, we "close" the Excel file.ods tagsets.excelxp close;

We can now open the Excel file and see the seperate tabs for each ODS table.

The style indicated effects the appearance of the output we see in Excel. For example, we used the "minimal" style and we can look at the ANOVA table in the second tab for an example of this style:

To see a list of the available styles, run the SAS code below:

ods listing;proc template; list styles; run; quit;

We employed a few of the "options" to format our results in Excel. To see the full list of options, run the SAS code below:

filename temp temp;ods tagsets.ExcelXP file=temp options(doc='help');ods tagsets.ExcelXP close;

We can look at another style and some additional options. In the example code below, we create a two-way frequency table in SAS and output the results to Excel with the "printer" style and we have added a title to the output in Excel with the embedded_titles option in our ODS options statement.

ODS TAGSETS.EXCELXPfile='D:\work\sas9\tab2.xls'STYLE=PrinterOPTIONS ( Orientation = 'landscape'FitToPage = 'yes'Pages_FitWidth = '1'Pages_FitHeight = '100' embedded_titles = 'yes');

title 'Mean Write Scores by SES and Female';proc format; value fm 1='Female' 0='Male' ; value ses 1='Low' 2='Middle' 3='High';run;proc tabulate data = ats.hsb2; class ses female; var write /style = {tagattr='format:000'}; table ses=''*mean, write=''*[style={tagattr='format:#0.00'}]*female='' / rtspace=42; format female fm. ses ses.;run;ods tagsets.excelxp close;

We can view these results to see how the "printer" style appears in Excel:

For further details on sending results from SAS to Excel, see Vincent DelGobbo's paper Creating Multi-Sheet Excel Workbooks the Easy Way with SAS.

SAS FAQ How can I send SAS data/results to specific cells in an Excel spreadsheet?

SAS output is rarely the form in which results are presented. Many create results tables in Excel. This page will provide an example of how to send data or results generated in SAS to specific cell locations in an Excel worksheet. We will be using the Dynamic Data Exchange (DDE) method in SAS to do so. DDE allows you to move information from SAS to Windows applications. We can start with a small example. We can open a new Excel sheet and send three variable names to the first row and then generate three variables to put in the cells below the names.

To open a new Excel sheet from SAS, we use the x command followed by the path to Excel program folder containing the .exe file. We have indicated noxwait and noxsync. The first allows you to use the xcommand, which opens outside programs, without typing "exit" before returning to SAS. The second turns off the normal buffering initiated with the X Window System.

options noxwait noxsync;x '"C:\Program Files\Microsoft Office\OFFICE11\excel.exe"';

Running the above two lines (with the appropriate pathname to Excel) will open a blank Excel spreadsheet. Next, we will indicate with a filename the sheet and cells of the open sheet that we will write to from SAS.

filename example1 dde 'excel|sheet1!r1c1:r1c3';

Above, we are creating a filename example1 that will write to the sheet and cells indicated--sheet1 (the default name of a new Excel file), from the first cell in the first row ("r1c1") to the third cell in the first row ("r1c3"). Next, we can run the data step below. It includes a file statement referring to the specified location in the open Excel

file. We create three variables, x, y and z, with one value each. Then weput the three variables into Excel.

data _null_; file example1; x = "x"; y = "y"; z = "z"; put x y z;run;

We can see that these character values have been put into the first three cells of the first row in Excel.

Next, we can indicate the next block of cells that we wish to write to.

filename example1 dde 'excel|sheet1!r2c1:r101c3';

We will create a variable with 100 values drawn with the ranuni function, and then create two more variables based on that variable. We will again use file and put statements.

data _null_; file example1; do i=1 to 100; x=ranuni(i); y=10+x; z=x-10; put x y z; end;run;

We can see that the Excel sheet now contains these values.

This use of DDE can be very useful if you have multiple data sets for which you want to do the same analysis and present results in a consistent template in Excel. We created a small results template in Excel for presenting the means of three groups in a dataset. The mean values have been left blank and we saved the spreadsheet as DDE_template.xls and named the template sheet means.

Next, we can look at an example dataset with three groups and use proc means to calculate the group means.

data d1; input group score; cards;1 7.9600000381 7.5900001531 7.8499999051 81 7.752 8.0500001912 8.0699996952 7.780000212 7.6199998862 8.1700000763 8.0900001533 8.2299995423 83 8.0900001533 7.880000114;

proc sort data = d1; by group;run;

proc means noprint data = d1; by group; var score; output out = d1_means;run;

Now, we can open our template spreadsheet and use a filename statement to indicate the three empty cells in our template.

options noxwait noxsync;x '"D:\Data\DDE_template.xls"';filename example2 dde 'excel|means!r2c3:r4c3';

This will open the our template file. Next, we can use file and put statements for the three statistics of interest.

data d1_means; set d1_means; file example2; if _stat_ = "MEAN" then put score;run;

We can now look at our filled-in template. Our results have been formatted according to the formats for the cells to which they are exported.

SAS FAQHow can I include cells with zero counts in proc freq with the list option?

It is not uncommon for a cross tabulation of two variables to produce cells with zero counts. In these cases, the output for proc freq with the list option will omit combinations of variable values that have zero counts. For example, in the sample data below, there are no cases where gender = 2 and eth = 2. The dataset below contains three variables, two variables, gender, and eth, are the categorical variables we want to cross tabulate. The third variable is count, which is a frequency weight, indicating the number of cases with that pattern in the dataset, i.e. there are 21 cases where gender=1 and eth=3.

data test; input gender eth count;datalines;1 1 121 2 121 3 212 1 22 3 43;run;

When we run proc freq without the list option, the table includes the cell with a zero count (i.e. gender=2 and eth=2). The weight statement specifies that the variable count should be used to determine the number of cases with each pattern in the dataset.

proc freq data = test; weight count; table gender*eth;run; The FREQ Procedure Table of gender by eth

gender eth

FrequencyPercent Row Pct Col Pct 1 2 3 Total 1 12 12 21 45 13.33 13.33 23.33 50.00 26.67 26.67 46.67 85.71 100.00 32.81

2 2 0 43 45 2.22 0.00 47.78 50.00 4.44 0.00 95.56 14.29 0.00 67.19

Total 14 12 64 90 15.56 13.33 71.11 100.00

Running proc freq with the list option as shown below, however, produces a table without a line for gender=2 and eth=2, because the count for this combination is zero.

proc freq data = test; weight count; table gender*eth /missprint list;run; Cumulative Cumulativegender eth Frequency Percent Frequency Percent

1 1 12 13.33 12 13.33 1 2 12 13.33 24 26.67 1 3 21 23.33 45 50.00 2 1 2 2.22 47 52.22 2 3 43 47.78 90 100.00

It is possible, with a little work, to get a table such as the one above, that does contain the missing cell. The first step is to create a variable that is a constant equal to one. We do this below with the variable called one.

data test;

set test; one = 1;run;

Next we will use ods to capture the output from running proc means. The option summary = t instructs SAS to store the summary produced by proc means in a dataset t. In the proc means command, then option specifies that only the n (i.e. the count of cases) should be computed. The completetypes option specifies that all combinations of the class variables (gender and eth) should be listed, including those with zero counts in the dataset, otherwise they would be omitted. The freq option specifies that the variable count should be used as the number of cases that follow a given pattern in the dataset (i.e. in this example, freq does in proc means, what weight did in proc freq). The var statement indicates the variable about which statistics should be calculated, this is where the variable one is used.

ods output summary = t;proc means data = test n completetypes; class gender eth; freq count; var one;run;

The MEANS Procedure

Analysis Variable : one

gender eth N Obs N

1 1 12 12 2 12 12 3 21 21 2 1 2 2 2 0 0 3 43 43

This creates a dataset t (shown below) with six observations, one for each combination of eth by gender both of which appear as variables in our new dataset. The new dataset also contains a variable calledone_N, which is the number of cases that fall into each of the categories of gender by eth (including a count of 0 where appropriate).

proc print data=t;run;

Obs gender eth NObs one_N

1 1 1 12 12 2 1 2 12 12 3 1 3 21 21 4 2 1 2 2

5 2 2 0 0 6 2 3 43 43

Now we can run proc freq on the new dataset t. The weight statement uses one_N as the variable containing the count of cases with a given pattern of variables. The /zeros option instructs SAS to include in the list cases in which one_N = 0. The resulting list includes a row for the combination of gender and eth with a zero count.

proc freq data = t; table gender*eth /list; weight one_N /zeros;run; The FREQ Procedure

Cumulative Cumulativegender eth Frequency Percent Frequency Percent

1 1 12 13.33 12 13.33 1 2 12 13.33 24 26.67 1 3 21 23.33 45 50.00 2 1 2 2.22 47 52.22 2 2 0 0.00 47 52.22 2 3 43 47.78 90 100.00

how do i read a file that uses commas

Documents

raw file

raw data file

input file

permanent sas data file

comma delimited file

tab delimited file

amc amc amc audi audi

special file extension