chapter 18 reading free-format data. 2 objectives read free-format data not recognized in fixed...
Post on 17-Jan-2018
236 Views
Preview:
DESCRIPTION
TRANSCRIPT
Chapter 18Reading Free-Format Data
2
Objectives
• Read free-format data not recognized in fixed fields.• Read free-format data separated by non-blank
delimiters, such as commas.• Read a raw data file with missing data (at the end
middle or beginning of a record).• Read character values exceeding 8 characters.• Read nonstandard free-format data.• Read character values containing embedded blanks.
What is FREE-FORMAT data• The data values not arranged in fixed fields.• Data values separated by blanks or some specific delimiters.• Numeric data values that are not in standard format.
Issues that need special attention when reading free-format data:
• How to handle missing data in free-format data set?• The danger of incorrect variable length. • How to handle data values with quotation marks?• Informats used in Formatted Input are not the same when
reading free-format data values.
4
List Input with the Default Delimiter(Blank is the Default Delimiter)
The data is not in fixed columns. The fields are separated by spaces. There is one nonstandard field.
50001 4feb1989 132 53050002 11nov1989 152 54050003 22oct1991 90 53050004 4feb1993 172 55050005 24jun1993 170 51050006 20dec1994 180 520
LIST INPUT and its variations
To read a free-format data, the simplest INPUT is by using LIST INPUT.
The general Syntax:INPUT variable <$> ;• Variable is the variable name to be read.• $ specifies character variable.
NOTE: • The list input style signals to the SAS System that fields are
separated by delimiters.• SAS then reads from non-delimiter to delimiter instead of from a
specific location on the raw data record.
IMPORTANT CONDITIONS for LIST Input:
• All fields must be separated by at least one blank.• Fields must be read sequentially from left to right• Can not skip or re-read fields.• Missing data for character variable must be
specified using user-defined missing (can not use blank as missing, since Blank is the delimiter.
• Missing data for numeric must be specified using ‘. ‘ Or other user-defined missing (can not use blank for numeric missing).
7
Delimiters
tab characters
A space (blank) is the default delimiter.
blanks
commas
Common delimiters are
8
Input Data involving Date, Time
• The second field is a date. How does SAS store dates?
50001 4feb1989 132 53050002 11nov1989 152 54050003 22oct1991 90 53050004 4feb1993 172 55050005 24jun1993 170 51050006 20dec1994 180 520
9
Standard Data
• The term standard data refers to character and numeric data that SAS recognizes automatically.
• Some examples of standard numeric data include– 35469.93– 3E5 (exponential notation)– -46859.
• Standard character data is any character you can type on your keyboard. Standard character values are always left-justified by SAS.
10
Nonstandard Data
• The term nonstandard data refers to character and numeric data that SAS does not recognize automatically.
• Examples of nonstandard numeric data include– 12/12/2012– 29FEB2000– 4,242– $89,000.
11
Informats• To read in nonstandard data, you must apply
an informat.
• General form of an informat:
• Informats are instructions that specify how SAS reads raw data.
<$>INFORMAT-NAME<w>.<d>
12
Informats
Examples of informats are• COMMAw. reads numeric data ($4,242) and
strips out selected nonnumeric characters, such as dollar signs and commas, dashes, blanks.
• MMDDYYw. reads dates in the form 12/31/2012.• DATEw. reads dates in the form 29Feb2000.
Reading Free-Format data with Delimiters
• By default, free-format data values are separated by BLANKS. SAS reads a data value until it reaches the next blank.
• Blank is not the only delimiter to separate data values. SAS allows user-specified delimiters, as long as it is not part of the data values. For example, one can use / , % ; and so on as delimiter to create the external free-format data set.
• The option DLM = ‘ ‘ in the INFILE statement is needed to inform the SAS INPUT statement the delimiters used.
Ex:INFILE ‘path-to-the-file’ DLM = ‘,’ ; informs the INPUT statement to read data value until comma ( , ) is reached.
Example
LA50001,4feb1989,132, 530PHIL50002, 11nov1989, 152 ,540NEWYORK50003 ,22oct1991, 90, 530CHICAGO50004, 4feb1993 ,172 ,550DETROIT50005 ,24jun1993, 170 ,510DALLAS50006, 20dec1994, 180, 520
The following is an airplane data set consisting of ID, date_inservice, # of passenger capacity and # of cargo capacity
The data values are separated by comma and space.How does SAS read this data set?
15
Reading a Delimited Raw Data Filedata airplanes; infile 'raw-data-file‘ DLM = ‘, ’; input ID $ InService date9. PassCap CargoCap;run;
Exercise• Write a SAS program to read the following data. Variables are: Location,
date # of passengers # of cargos for the flight
LA50001,4feb1989,132, 530 PHIL50002,11nov1989, 152 ,540 NEWYORK50003,22oct1991 , 90, 530 CHICAGO50004,4feb1993 , 172 ,550 DETROIT50005,24jun1993, 170 ,510DALLAS50006,20dec1994 , 180, 520 • Print the data.Save the program as c18_freeform1 to the SASEx folder in your c-drive.
• Observe the results. You should notice that some data values for Location are not complete.
• What is the cause of incomplete data values?• How to solve this problem?
data airplane;infile datalines dlm=', ' ;input Loc $ date date9. npas ncargo;datalines;LA50001,4feb1989,132, 530 PHIL50002,11nov1989, 152 ,540 NEWYORK50003,22oct1991 , 90, 530 CHICAGO50004,4feb1993 , 172 ,550 DETROIT50005,24jun1993, 170 ,510DALLAS50006,20dec1994 , 180, 520 ;run; proc print; format date date9. ; run;
Answer
Results
Obs Loc date npas ncargo
1 LA50001 04FEB1989 132 5302 PHIL5000 11NOV1989 152 5403 NEWYORK5 22OCT1991 90 5304 CHICAGO5 04FEB1993 172 550 5 DETROIT5 24JUN1993 170 5106 DALLAS50 20DEC1994 180 520
What is wrong with this result?NOTE: The some of the LOC’s are not complete.
NOTE: It is 8 characters. But, some of the ID’s are more than 8.
19
Lengths of Variables read using free-format
• When you use list input, the default length for character and numeric variables is 8 bytes.
• You can set the length of character variables with a LENGTH statement or with an informat.
• General form of a LENGTH statement:
LENGTH variable-name <$> length-specification ...;
20
Setting the Length of a Variable
data airplanes; length ID $ 15.; infile 'raw-data-file‘ DLM = ‘ , ‘; input LOC $ InService date9. PassCap CargoCap;run;
Exercise
Open the program c18_freeform1, revise the program to make the data values for Location are complete.
Answerdata airplane;Length Loc $ 15.;infile datalines dlm=', ' ;input Loc $ date date9. npas ncargo;datalines;LA50001,4feb1989,132, 530 PHIL50002,11nov1989, 152 ,540 NEWYORK50003,22oct1991 , 90, 530 CHICAGO50004,4feb1993 , 172 ,550 DETROIT50005,24jun1993, 170 ,510DALLAS50006,20dec1994 , 180, 520 ;run; proc print; format date date9. ; run;
Correct Results
Obs LOC date npas ncargo
1 LA50001 04FEB1989 132 5302 PHIL50002 11NOV1989 152 5403 NEWYORK50003 22OCT1991 90 5304 CHICAGO50004 04FEB1993 172 5505 DETROIT50005 24JUN1993 170 5106 DALLAS50006 20DEC1994 180 520
24
ID$5
data airplanes; length ID $ 5.’; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap;run;
50001 4feb1989 132 53050002 11nov1989 152 54050003 22oct1991 90 53050004 4feb1993 172 55050005 24jun1993 170 51050006 20dec1994 180 520
Raw Data File Compile
PDV
Input Buffer
...
25
ID$5
PASSCAPN8
CARGOCAPN8
INSERVICEN8
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap;run;
50001 4feb1989 132 53050002 11nov1989 152 54050003 22oct1991 90 53050004 4feb1993 172 55050005 24jun1993 170 51050006 20dec1994 180 520
Raw Data File Compile
PDV
Input Buffer
...
26
50001 4feb1989 132 53050002 11nov1989 152 54050003 22oct1991 90 53050004 4feb1993 172 55050005 24jun1993 170 51050006 20dec1994 180 520
Raw Data File Execute
PDV
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap;run;
Input Buffer
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
....
27
50001 4feb1989 132 53050002 11nov1989 152 54050003 22oct1991 90 53050004 4feb1993 172 55050005 24jun1993 170 51050006 20dec1994 180 520
Raw Data File
PDV
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap;run;
Input Buffer
5 0 0 0 1 4 f e b 1 9 8 9 1 3 2 5 3 0
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
....
28
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
.
50001 4feb1989 132 53050002 11nov1989 152 54050003 22oct1991 90 53050004 4feb1993 172 55050005 24jun1993 170 51050006 20dec1994 180 520
Raw Data File
PDV
50001 10627 132 530
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap;run;
Input Buffer
...
5 0 0 0 1 4 f e b 1 9 8 9 1 3 2 5 3 0
29 Write out observation to airplanes.
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap;run;
50001 4feb1989 132 53050002 11nov1989 152 54050003 22oct1991 90 53050004 4feb1993 172 55050005 24jun1993 170 51050006 20dec1994 180 520
Raw Data File
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
.
PDV
50001 10627 132 530
Input BufferImplicit output
...
5 0 0 0 1 4 f e b 1 9 8 9 1 3 2 5 3 0
30
data airplanes; length ID $ 5; infile 'raw-data-file'; input ID $ InService date9. PassCap CargoCap;run;
50001 4feb1989 132 53050002 11nov1989 152 54050003 22oct1991 90 53050004 4feb1993 172 55050005 24jun1993 170 51050006 20dec1994 180 520
Raw Data File
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
.
PDV
50001 10627 132 530
Input BufferImplicit return
...
5 0 0 0 1 4 f e b 1 9 8 9 1 3 2 5 3 0
31
Using the DLM= Option in the INFILE statement
• The DLM= option sets a character or characters that SAS recognizes as a delimiter in the raw data file.
• General form of the INFILE statement with the DLM= option:
• Any character you can type on your keyboard can be a delimiter. You can also use hexadecimal characters.
INFILE 'raw-data-file' DLM='delimiter(s)';
Reading Missing Values
There are two situations may occur when reading a free-format data involving missing data:
• Missing values at the END of a record• Missing values at the BEGINNING or MIDDLE of a
record
33
50001 , 4feb1989,13250002, 11nov1989,152, 540 50003, 22oct1991,90, 53050004, 4feb1993,17250005, 24jun1993, 170, 51050006, 20dec1994, 180, 520
Missing Data at the End of a Record
34
Missing Data at the End of a Row
• By default, when there is missing data at the end of a row, SAS will continue to read the missing data value from the next record:
1. SAS loads the next record to finish the observation.
2. A note is written to the log3. SAS loads a new record at the top of the DATA step
and continues processing.
35
data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap;run;
50001 , 4feb1989,13250002, 11nov1989,152, 540 50003, 22oct1991,90, 53050004, 4feb1993,17250005, 24jun1993, 170, 51050006, 20dec1994, 180, 520
Raw Data File Execute
PDV
Input Buffer
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
....
36
data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap;run;
50001 , 4feb1989,13250002, 11nov1989,152, 540 50003, 22oct1991,90, 53050004, 4feb1993,17250005, 24jun1993, 170, 51050006, 20dec1994, 180, 520
Raw Data File
PDV
Input Buffer
5 0 0 0 1 , 4 f e b 1 9 8 9 , 1 3 2
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
....
37
Raw Data File50001 , 4feb1989,13250002, 11nov1989,152, 540 50003, 22oct1991,90, 53050004, 4feb1993,17250005, 24jun1993, 170, 51050006, 20dec1994, 180, 520
data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap;run;
Input Buffer
5 0 0 0 1 , 4 f e b 1 9 8 9 , 1 3 2
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
.
PDV
50001 10627 132...
No data
38
50001 , 4feb1989,13250002, 11nov1989,152, 540 50003, 22oct1991,90, 53050004, 4feb1993,17250005, 24jun1993, 170, 51050006, 20dec1994, 180, 520
Raw Data File data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap;run;
Input Buffer
5 0 0 0 2 , 1 1 n o v 1 9 8 9 , 1 5 2 , 5 4 0
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
.
PDV
...
SAS loadsnext record.
5000250001 10627 132
39 Write out observation to airplanes.
50001 , 4feb1989,13250002, 11nov1989,152, 540 50003, 22oct1991,90, 53050004, 4feb1993,17250005, 24jun1993, 170, 51050006, 20dec1994, 180, 520
Raw Data File data airplanes; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap;run;
Input Buffer
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
.
PDV
5000250001 10627 132
Implicit output
5 0 0 0 2 , 1 1 n o v 1 9 8 9 , 1 5 2 , 5 4 0
40
50001 , 4feb1989,13250002, 11nov1989,152, 540 50003, 22oct1991,90, 53050004, 4feb1993,17250005, 24jun1993, 170, 51050006, 20dec1994, 180, 520
Raw Data File data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap;run;
Input Buffer
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
.
PDV
...
5000250001 10627 132
Implicit return
5 0 0 0 2 , 1 1 n o v 1 9 8 9 , 1 5 2 , 5 4 0
41
Input Buffer
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
.
data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap;run;
50001 , 4feb1989,13250002, 11nov1989,152, 540 50003, 22oct1991,90, 53050004, 4feb1993,17250005, 24jun1993, 170, 51050006, 20dec1994, 180, 520
Raw Data File
PDV
5 0 0 0 3 , 2 2 o c t 1 9 9 1 , 9 0 , 5 3 0
...
42
Input Buffer
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
.
data airplanes3; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService : date9. PassCap CargoCap;run;
50001 , 4feb1989,13250002, 11nov1989,152, 540 50003, 22oct1991,90, 53050004, 4feb1993,17250005, 24jun1993, 170, 51050006, 20dec1994, 180, 520
Raw Data File
PDV
5 0 0 0 3 , 2 2 o c t 1 9 9 1 , 9 0 , 5 3 0
...
Continue processinguntil end of the raw data file.
43
NOTE: 6 records were read from the infile 'aircraft3.dat'. The minimum record length was 19. The maximum record length was 26.NOTE: SAS went to a new line when INPUT statement reached past the end of a line.NOTE: The data set WORK.AIRPLANES3 has 4
observations and 4 variables.
Partial Log
44
proc print data=airplanes3 noobs;run;
In Pass Cargo ID Service Cap Cap
50001 10627 132 5000250003 11617 90 53050004 12088 172 5000550006 12772 180 520
Missing Data at the End of the Row
PROC PRINT Output
45
Use the MISSOVER Option in INFILE statement to handle missing at the end of a record
• The MISSOVER option prevents SAS from loading a new record when the end of the current record is reached.
• General form of the INFILE statement with the MISSOVER option:
• If SAS reaches the end of the row without finding values for all fields, variables without values are set to missing.
INFILE 'raw-data-file' MISSOVER;
46
Using the MISSOVER Option
data airplanes; length ID $ 5; infile 'raw-data-file' dlm=',' missover; input ID $ InService : date9. PassCap CargoCap;run;
47
Partial SAS Log
NOTE: 6 records were read from the infile 'aircraft3.dat'. The minimum record length was 19. The maximum record length was 26.NOTE: The data set WORK.AIRPLANES3 has 6 observations and 4 variables.
Using the MISSOVER Option
48
proc print data=airplanes noobs;run;
In Pass Cargo ID Service Cap Cap
50001 10627 132 .50002 10907 152 54050003 11617 90 53050004 12088 172 .50005 12228 170 51050006 12772 180 520
Using the MISSOVER Option
PROC PRINT Output
Missing Values at the beginning or Middle of a record
There are situations where missing values occur in the beginning of a record or middle of a record.
Since multiple delimiters , such as ,, is treated as a delimiter, simply using DLM = ‘,’ will not able to take care of these situations here.
50
Missing Values without Placeholders
• There is missing data represented by two consecutive delimiters.
50001 , 4feb1989,, 53050002, 11nov1989,132, 54050003, 22oct1991,90, 53050004, 4feb1993,172, 55050005, 24jun1993,, 51050006, 20dec1994, 180, 520
51
5 0 0 0 1 , 4feb1989 , . , 5 3 0
Missing Values without Placeholders
• By default, SAS treats two consecutive delimiters as one. Missing data should be represented by a placeholder by filling the missing value with proper missing value such as a period (.) for numeric missing.
• However, it is not possible to use blank as missing for character values, using a placeholder for character variable means to define a string as missing and then, writing a SAS program to convert the string into missing data.
• Alternatively, one can use an option DSD in the INFILE statement to handle these missing cases.
52
Missing Values without Placeholders
data airplanes4; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService date9. PassCap CargoCap;run;
53
data airplanes4; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService date9. PassCap CargoCap;run;
50001 , 4feb1989,, 53050002, 11nov1989,132, 54050003, 22oct1991,90, 53050004, 4feb1993,172, 55050005, 24jun1993,, 51050006, 20dec1994, 180, 520
Raw Data File Execute
PDV
Input Buffer
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
....
54
data airplanes4; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService date9. PassCap CargoCap;run;
50001 , 4feb1989,, 53050002, 11nov1989,132, 54050003, 22oct1991,90, 53050004, 4feb1993,172, 55050005, 24jun1993,, 51050006, 20dec1994, 180, 520
Raw Data File
PDV
Input Buffer
5 0 0 0 1 , 4 f e b 1 9 8 9 , , 5 3 0
ID$5
PASSCAPN8
.
CARGOCAPN8
.
INSERVICEN8
....
55
. ..
Raw Data File50001 , 4feb1989,, 53050002, 11nov1989,132, 54050003, 22oct1991,90, 53050004, 4feb1993,172, 55050005, 24jun1993,, 51050006, 20dec1994, 180, 520
data airplanes4; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService date9. PassCap CargoCap;run;
Input Buffer
50001 10627 530...
No data PDV
ID$5
PASSCAPN8
CARGOCAPN8
INSERVICEN8
5 0 0 0 1 , 4 f e b 1 9 8 9 , , 5 3 0
56
50001 , 4feb1989,, 53050002, 11nov1989,132, 54050003, 22oct1991,90, 53050004, 4feb1993,172, 55050005, 24jun1993,, 51050006, 20dec1994, 180, 520
Raw Data File data airplanes4; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService date9. PassCap CargoCap;run;
Input Buffer
...
5 0 0 0 1 , 4 f e b 1 9 8 9 , , 5 3 0
PDVID$5
PASSCAPN8
CARGOCAPN8
INSERVICEN8
. ..50001 10627 530
57
50001 , 4feb1989,, 53050002, 11nov1989,132, 54050003, 22oct1991,90, 53050004, 4feb1993,172, 55050005, 24jun1993,, 51050006, 20dec1994, 180, 520
Raw Data File data airplanes4; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService date9. PassCap CargoCap;run;
Input Buffer
5 0 0 0 2 , 1 1 n o v 1 9 8 9 , 1 3 2 , 5 4 0
...
SAS loadsnext record.
. ..50001 10627 530
PDVID$5
PASSCAPN8
CARGOCAPN8
INSERVICEN8
58
. ..50001 10627 530
50001 , 4feb1989,, 53050002, 11nov1989,132, 54050003, 22oct1991,90, 53050004, 4feb1993,172, 55050005, 24jun1993,, 51050006, 20dec1994, 180, 520
Raw Data File data airplanes4; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService date9. PassCap CargoCap;run;
Input Buffer
5 0 0 0 2 , 1 1 n o v 1 9 8 9 , 1 3 2 , 5 4 0
...50002
PDVID$5
PASSCAPN8
CARGOCAPN8
INSERVICEN8
59 Write out observation to airplanes4.
50001 , 4feb1989,, 53050002, 11nov1989,132, 54050003, 22oct1991,90, 53050004, 4feb1993,172, 55050005, 24jun1993,, 51050006, 20dec1994, 180, 520
Raw Data File data airplanes4; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService date9. PassCap CargoCap;run;
Input Buffer
. ..50001 10627 132 530
Implicit output
...50002
5 0 0 0 2 , 1 1 n o v 1 9 8 9 , 1 3 2 , 5 4 0
PDVID$5
PASSCAPN8
CARGOCAPN8
INSERVICEN8
60
50001 , 4feb1989,, 53050002, 11nov1989,132, 54050003, 22oct1991,90, 53050004, 4feb1993,172, 55050005, 24jun1993,, 51050006, 20dec1994, 180, 520
Raw Data File data airplanes4; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService date9. PassCap CargoCap;run;
Input Buffer
. ..50001 10627 132 530
Implicit return
...50002
5 0 0 0 2 , 1 1 n o v 1 9 8 9 , 1 3 2 , 5 4 0
PDVID$5
PASSCAPN8
CARGOCAPN8
INSERVICEN8
61
data airplanes4; length ID $ 5; infile 'raw-data-file' dlm=','; input ID $ InService date9. PassCap CargoCap;run;
50001 , 4feb1989,, 53050002, 11nov1989,132, 54050003, 22oct1991,90, 53050004, 4feb1993,172, 55050005, 24jun1993,, 51050006, 20dec1994, 180, 520
Raw Data File
Input Buffer
5 0 0 0 3 , 2 2 o c t 1 9 9 1 , 9 0 , 5 3 0
. .....
PDVID$5
PASSCAPN8
CARGOCAPN8
INSERVICEN8
62
NOTE: 6 records were read from the infile 'aircraft4.dat'. The minimum record length was 21. The maximum record length was 26.NOTE: SAS went to a new line when INPUT statement reached past the end of a line.NOTE: The data set WORK.AIRPLANES4 has 4 observations and 4 variables.
Missing Values without Placeholders
Partial Log
The missing is not correctly read.
63
proc print data=airplanes4 noobs;run;
In Pass Cargo ID Service Cap Cap
50001 10627 530 5000250003 11617 90 53050004 12088 172 55050005 12228 510 50006
Missing Values without Placeholders
PROC PRINT Output
This is not correct. Not only missing values are not correctly read, more errors have occurred.
64
5 0 0 0 1 , 4feb1989 ,, 5 3 0
Missing Values without Placeholders• If your data does not have placeholders, use
the DSD option.
65
The DSD Option
• General form of the DSD option in the INFILE statement:
INFILE ‘file-name’ DSD;
66
The DSD Option
• The DSD option– sets the default delimiter to a comma– treats consecutive delimiters as missing values– enables SAS to read values with embedded
delimiters if the value is surrounded by double quotes.
67
Using the DSD Option
data airplanes4; length ID $ 5; infile 'raw-data-file' dsd; input ID $ InService date9. PassCap CargoCap;run;
68
NOTE: 6 records were read from the infile 'aircraft4.dat'. The minimum record length was 22. The maximum record length was 25.NOTE: The data set WORK.AIRPLANES4 has 6 observations and 4 variables.
Missing Values Without Placeholders
Partial Log
69
proc print data=airplanes4 noobs;run;
In Pass Cargo ID Service Cap Cap
50001 10627 . 53050002 10907 132 54050003 11617 90 53050004 12088 172 55050005 12228 . 51050006 12772 180 520
Using the DSD Option
PROC PRINT Output
Exercise
Open the program c18_freeformat_missingRun the program, and observe the problem.Revise the program so that the missing data are properly handled.
Answerdata carsales;infile datalines dlm = ‘,’ missover DSD;input year country $ type $ sales;datalines;1998,US,CARS, 194324.121998,US,TRUCKS,142290.301998, CANADA,CARS,10483.441998, CANADA,TRUCKS,1998,JAPAN,CARS,15066.431998,JAPAN, TRUCKS ,40700.341997 ,,CARS , 213504.051997,US,TRUCKS,116735.651997,CANADA,CARS,904.891997,CANADA,TRUCKS,76576.121997,JAPAN,CARS,10000.181997,JAPAN,TRUCKS,50458.22;proc print data = carsales; run;
Exercise
Open c18_freeformat2Run the program, observe the results, and revise the program to read the data correctly.
Answer
data carsales2;length type $ 14. ;infile datalines dlm = '/' missover dsd;input year (country type) ($) sales comma10.;datalines;1998/US/CARS/$194324.121998/US/TRUCKS_GM/ $142290.301998/CANADA/CARS/$10483.441998/CANADA/TRUCKS_FORD/ 1998/JAPAN/CARS/$15066.431998/JAPAN/'TRUCKS_HUNDA'/$40700.341997/US/CARS/$213504.051997//TRUCKS_FORD/ $116735.651997/CANADA/CARS/$904.891997/CANADA/TRUCKS_GM/$76576.12/JAPAN/CARS/$10000.181997/JAPAN/TRUCKS_TOYOTA/$50458.22;proc print data = carsales2; title ' / as delimiter '; run;proc contents; run;
74
Specifying an InformatTo specify an informat, use the colon (:) format modifier in the
INPUT statement between the variable name and the informat.General form of a format modifier in an INPUT statement:
NOTE: The informat used for free-format is not the same as the informat used in the Fixed Format input:
• Informat in Fixed Formatted Input is the format specifying the columns and how the data created in the raw data, so that the data values will be read based on the Informat.
• The Informat in free-format input is the format that the data values will be read to the new data set to be created.
INPUT variable : informat;
Modifying List Input
In reading free-format data, it is difficult to specify an informat that defines the # of columns to be read from the data set, since the # of columns is often not properly formatted. Also, nonstandard data values can not be properly read in these situations.
SAS provides two modifiers to help defining the informat.
Modifiers used in LIST INPUT
• The ampersand (&) modifier is used to read character values that contain embedded blanks.
• The colon ( : ) modifier is used to read nonstandard data values and character values that are longer than 8 characters, but which contain no embedded blanks.
Use the Modifier (&) in LIST INPUT• & enables to read characters contain single embedded
blanks, such as NEW YORK as a character value, and there is an embedded blank. Using DLM = ‘ ‘ will read NEW YORK as two character values: NEW and YORK.
But, we want to read it as NEW YORK as one data value. • Use & allows to read this as one data value. However, in
order to stop reading further into the next data value as part of NEW YORK, it requires TWO or MORE blanks following NEW YORK.
• & helps to read data values with one embedded blanks until it reaches TWO or more blanks.
Example of applying Modifier &
Data set (City , Population)NEW YORK 7,262,700LOS ANGLES 3,259,340CHICAGO 3,009,530HOUSTON 1,728910
To read this data set,
Data city_pop;Input city $ & population comma10.;Datalines;NEW YORK 7,262,700LOS ANGLES 3,259,340CHICAGO 3,009,530HOUSTON 1,728,910;run; proc print; run;
The results from previous program using & modifier
The SAS System 13:23 Monday, November 15, 2010 28 Obs city population 1 NEW YORK 7262700 2 LOS ANGL 3259340 3 CHICAGO 3009530 4 HOUSTON 1728910
NOTE: The data value LOS ANGELOS is not read correctly. It has the default length of 8, not the correct length of 10 in this case.
To handle this problem, we introduce the use of LENGTH statement previously:
LENGTH city $ 10; SAS has another way to do this by using modifier & with an
informat together.
Using the & Modifier with an InformatData city_pop;Input city & $10. population comma10.;Datalines;NEW YORK 7,262,700LOS ANGLES 3,259,340CHICAGO 3,009,530HOUSTON 1,728,910;run; proc print; run;
NOTE: Once use $10. in the list input, one does not need to define the LENGTH statement. Since it defines the length for storing the CITY.
Some cautions of using &
• NOTE: $10. does not specify the # of columns to be read for city variable. It specifies the length to store the data value city when it is used with &.
• You MUST use two consecutive blanks as delimiters when use the & modifier.
• You can not use any other delimiter to indicate the end of each record.
Exercise
Open Program c18_freeformat_modifierRun each program to learn how modifiers work, review the options of using MISSOVER, DSD, Review the LENGTH statement,
Reading Nonstandard Values in LIST INPUT
• Nonstandard values, such as datew. , timew. Datetimew. , commaw.d, and so on require the user to specify the width, w. When this is used as Informat, w defines the # of columns to be read from the data.
• However, in a LIST INPUT, which is free-format, it is often very difficult to have the nonstandard values are properly defined in the correct # of columns.
• SAS introduces a LIST INPUT Modifier, Colon (:) to allows for reading the nonstandard values from delimiter to the next delimiter.
84
LIST INPUT Without the Colon
• The colon signals that SAS should read from delimiter to delimiter.
• If the colon is omitted, SAS reads the length of the informat, which may cause it to read past the end of the field.– No error message is printed.– You might see invalid data messages or
unexpected data values.
Use COLON (:) as Modifier in LSIT INPUT
• Colon (:) modifier enables user to read nonstandard data values and
• Read character values that are longer than 8 characters with no embedded blanks.
• It reads values until a blank (or a delimiter) is reached.
• If the informat $w. is specified, this length overrides the default length.
Example of using Colon (:) modifierData city_pop;Input city & $10. population : comma.;Datalines;NEW YORK 7,262,700LOS ANGLES 3,259,340CHICAGO 3,009,530HOUSTON 1,728,910;run; proc print; run;
NOTE: the informat COMMA. Does not specify the w value. List Input reads data value until the next delimiter is reached. The default length of numeric is 8 for storing the numeric value. There is no need to specify the length of a numeric variable.
• NOTE: The informat COMMA. does not specify the w value. List Input reads data value until the next delimiter is reached. The default length of numeric is 8 for storing the numeric value. There is no need to specify the length of a numeric variable.
• NOTE: If we DO NOT use Colon (: ), then, we must specify COMMAw.d in order to read the correct # of columns in then data. In this situation, w. is the # of columns read from the data set.
88
Problem Option Non-blank delimiters
DLM='delimiter(s)'
Missing data at end of row MISSOVER
Missing data represented by consecutive delimiters and/ or Embedded delimiters where values are surrounded by double quotes
DSD
INFILE Statement Options
These options can be used separately or together in the INFILE statement.
Creating Free-Format External DataSimilar to reading free-format external data, we
can also create free-format external data by using:
FILE ‘path-to-external-data-set’ <DLM = ‘delimiters’ MISSOVER DSD > ;
PUT variable <format>;
Format specifies the format to write the data values. This is particular useful when creating data values in nonstandard format such as commaw.d, date9. , mmddyy10. and so on.
An example to create city_pop.dat dataData city;Input city & $10. population : comma.;Datalines;NEW YORK 7,262,700LOS ANGLES 3,259,340CHICAGO 3,009,530HOUSTON 1,728,910;run; proc print; run;Data citypop; set city;File ‘c:\math707\rawdata\city_pop.dat’ dlm = ‘/’; Put city population comma.;Run;
An example of creating external data using free format when delimiter is , and some numeric
variables are also saved using COMMAw.d format
Data citypop; set city;File ‘c:\math707\rawdata\city_pop.dat’ dsd; Put city population:comma10.;Run;
NOTE: since both delimiter is , and population is stored with comma format, the data values needs to be treated in a way it is recognizable as a data value. Using DSD option in the FILE statement creates quotation marks for population.
When reading this type of data, one must also use DSD option in the INFILE statement and one should also be careful about the LENGHTH.
The resulting data setNEW YORK," 7,262,700"LOS ANGLES," 3,259,340"CHICAGO," 3,009,530"HOUSTON," 1,728,910“
To read this data set, one needs to use DSD option in the INFILE statement.
Data citypop2;length city $ 10; infile 'c:\math707\rawdata\city_pop3.dat' DSD ; input city $ population : comma10.;Run;proc print; run;
The resulting data
Obs city population
1 NEW YORK 7262700 2 LOS ANGLES 3259340 3 CHICAGO 3009530 4 HOUSTON 1728910
Writing Character Strings and variable values in the external data set
Data citypop; set city;File ‘c:\math707\rawdata\city_pop.dat’ dsd;
Put ‘2000 City Census ‘ city ‘Total Population ‘ population : comma10.;
Run;
This program will create extra string to describe City and Population in the created data set.
Use PROC EXPORT procedure to create external data set
General Syntax:
PROC EXPORT DATA = ‘sas-data-set’ OUTFILE = filename’ DBMS=DLM REPLACE;
DELIMITER = ‘delimiter’;PUTNAME = <YES|NO>;RUN;
Using SAS pulldown menu, to export data set.File, Export Data, then follow the step-by-step menu to
create external file.
Exercise
Open program c18_put_freeformat_ExportRun the programs, and observe the result to make sure you learn how to write PUT statement and PROC EXPORT.
Exercise
Open program c18_Import to learn how to write PROC IMPORT procedure to read external data with free format
Mixing Input Styles
We have introduced • Column Input,• Formatted Input,• List InputAll of these input styles can be mixed in one
INPUT statement, depending on the situations.
Additional materials useful for reading delimited data
The textbook introduces the following options can be used in the INFILE statement for handling different situations when reading delimited external data:
MISSOVER, DSD, DLM = ‘delimiter’The follow are three additional useful options to handle the end
of a record:STOPOVER, TRUNCOVER, FLOWOVER
ExampleConsider the following data set, TESTNUM. ----+----1----+-122333444455555
We will show the effect of usingFLOWOVER, MISSOVER and TRUNCOVER options in the infile statement
The Value of TESTNUM Using Different INFILE Statement Options
OBS FLOWOVER MISSOVER TRUNCOVER
1 22 . 1
2 4444 . 22
3 55555 . 333
4 . 4444
5 55555 55555
data numbers; infile 'external-file'; input testnum 5.;run;
Explanation of these options• FLOWOVER is the default behavior. It causes the DATA step
to look in the next record if the end of the current record is encountered before all of the variables are assigned values
• MISSOVER causes the DATA step to assign missing values to any variables that do not have values when the end of a data record is encountered. The DATA step continues processing.
• STOPOVER causes the DATA step to stop execution immediately and write a note to the SAS log.
• TRUNCOVER causes the DATA step to assign values to variables, even if the values are shorter than expected by the INPUT statement, and to assign missing values to any variables that do not have values when the end of a record is encountered.
top related