data transformation sas 1

Upload: sridhardivakar1084

Post on 15-Oct-2015

16 views

Category:

Documents


0 download

DESCRIPTION

Working with SAS - Basics

TRANSCRIPT

  • Data TransformationData cleaning

  • Importing DataReading data from external formatsLibname/Infile/Input for text form dataProc Import for Excel/Access dataODBC for external database data

  • Importing an Excel SpreadsheetPROC IMPORT OUT= WORK.Fall2007 DATAFILE= "L:\DataWarehousing07f\CourseDatabase\Fall2007.xls" DBMS=EXCEL REPLACE; SHEET="'Fall 07$'"; GETNAMES=YES; MIXED=NO; SCANTEXT=YES; USEDATE=YES; SCANTIME=YES;RUN;

  • Import an Access TablePROC IMPORT OUT= WORK.OrderLine DATATABLE= "OrderLin" DBMS=ACCESS REPLACE; DATABASE="I:\DataWarehousing07f\WholesaleProducts.mdb"; SCANMEMO=YES; USEDATE=NO; SCANTIME=YES;RUN;

  • Good PracticeCheck the metadata for a datasetPROC CONTENTS DATA= OrderLine; RUN; Print a few recordsPROC PRINT DATA= OrderLine (OBS= 10); RUN;

  • Saving SAS DatasetsLIBNAME course "L:\DataWarehousing07f\CourseDatabase";Data course.Spring2008;set spring2008;run;Note: the name associated with the libname command (course) must be 8 characters or less.

  • LIBNAME / INFILE / INPUT for character dataLIBNAME identifies the location or folder where the data file is storedINFILE specifies the libname to use for reading external data.INPUT reads text format dataSET reads SAS data

  • INFILE with INPUT for character data filesDATA Fitness;INFILE "L:\DataWarehousing07f\TransformationSAS\SAS1.txt";INPUT NAME $ WEIGHT WAIST PULSE CHINS SITUPS JUMPS; run;

  • Creating Derived AttributesGenerating new attributes for a table. SAS creates attributes when they are referred to in a data step. The metadata depends on the context of the code.LENGTH statementsFORMAT statementsFORMATS and INFORMATSPUTINPUT

  • PUT and INPUT FunctionsTextOutput = PUT(variable, format)Note: the result of a put function is always characterNote: there is also a PUT statement that writes the contents of a variable to the SAS log

    Output = INPUT(CharacterInput, informat)Note: the variable for an input function is always character

  • FormatsFormats always contain a periodFormats for character variables always start with a $The most used format categories are Character, Date and Time, and Numeric

    Note: use the SAS search tab to look for Formats. For a list of SAS formats look under: Formats: Formats by Category

  • Good PracticeThe following code is handy for testing functions and formats in SAS. The _Null_ dataset name tells SAS not to create the datset in the WORK library

    Data _Null_;InputVal= 123;OutputVal= PUT(InputVal, Roman30.);PUT InputVal OutputVal;run;

  • Generating DatesGenerating a Date dimensionUsually done offline in something like ExcelSAS has extensive date and datetime functions and formatsSAS formats apply to only one of datetime, date or time variable types. Convert from one type to another with SAS functions.

  • Creating a text variable for DateData Orders2; Length Date $10.; Set Orders; Date= PUT( Datepart(OrderDate), MDDYY8.);The Length statement assures that the variable will have enough space. It must come before the SET.OrderDate has DateTime format. The DATEPART function produces a date format output. MMDDYYx. is a date format type.

  • SAS FunctionsWe are especially interested in Character and Date and Time functions

    Note: use the SAS search tab to look for Functions. For a list of SAS functions look under: Functions and CALL routines: Functions and CALL Routines by Category

  • Useful Data Cleaning FunctionsText Manipulation:COMPRESS, STRIP, TRIM, LEFT, RIGHT, UPCASE, LOWCASEText ExtractionINDEX, SCAN, SUBSTR, TRANSLATE, TRANWRD

  • ParsingThe process of splitting a text field into multiple fieldsUses SAS functions to extract parts of a character string.Fixed position in a string: SUBSTRKnown delimiter: SCANNote: it is a good idea to strip blanks before you try to parse a string.

  • Example of ParsingData Customer2;LENGTH street cust_addr $20.;FORMAT street cust_addr $20.;SET Customer;Cust_Addr= TRIM(Cust_Addr);Number= Scan(Cust_Addr,1,' ');Street= Scan(Cust_Addr,2,' ');run;

    Note: The LENGTH and FORMAT statements clear trailing blanks for further display.

  • Parsing Results Obs cust_addr Number street

    1 481 OAK 481 OAK 2 215 PETE 215 PETE 3 48 COLLEGE 48 COLLEGE 4 914 CHERRY 914 CHERRY 5 519 WATSON 519 WATSON 6 16 ELM 16 ELM 7 108 PINE 108 PINE

  • Good PracticeAlways print the before and after images here. Parsing free form text can be quite a problem. For example, apartment addresses 110b Elm and 110 b Elm will parse differently. In this case you may have to search the second word for things that look like apartments and correct the data.

  • =SUBSTR( string, position) Use this when you have a known position for characters.String: character expressionPosition: start position (starts with 1)Length: number of characters to take (missing takes all to the end)VAR= ABCDEFGNEWVAR= SUBSTR(VAR,2,2)NEWVAR2= SUBSTR(VAR,4)NEWVAR= BCNEWVAR2= DEFG

  • SUBSTR(variable, position) = new-charactersReplaces character value contents. Use this when you know where the replacement starts. a='KIDNAP'; substr(a,1,3)='CAT'; a: CATNAPsubstr(a,4)='TY' ;a: KIDTY

  • INDEX(source, excerpt) Searches a character expression for a string of characters. Returns the location (number) where the string begins.a='ABC.DEF (X=Y)'; b='X=Y'; x=index(a,b);x: 10x= index(a,DEF);x: 5

  • Alternative INDEX functionsINDEXC searches for a single characterINDEXW searches for a word:SyntaxINDEXW(source, excerpt)

  • LengthReturns the length of a character variableThe LENGTH and LENGTHN functions return the same value for non-blank character strings. LENGTH returns a value of 1 for blank character strings, whereas LENGTHN returns a value of 0.The LENGTH function returns the length of a character string, excluding trailing blanks, whereas the LENGTHC function returns the length of a character string, including trailing blanks. LENGTH always returns a value that is less than or equal to the value returned by LENGTHC.

  • StandardizingAdjusting terms to standard format.Based off of frequency prints.Use functions or IF statementsTRANWRD is easy but can produce unexpected resultsIF statements are safer, but less general

  • Standardization CodeSupplier= Tranwrd(supplier, " Incorporated", "");

    If Supplier= "Trinkets & Things" then supplier= "Trinkets n' Things";

    More complex logic is often needed. See the course examples.

  • Good PracticeIt is a good idea to produce a change log for standardized changes:Data Products2 Changed;Set Products;SupplierOld= Supplier;* * * * Output Products2;If Trim(supplier) ^= Trim(SupplierOld) then output Changed;Proc Print Data= Changed;Var SupplierOld Supplier;

  • Locating AnomaliesFrequency counts are a good way to identify anomalies.It is also helpful to identify standard changes that you do not have to review.Probably the safest way to execute standard changes is with a Change Table that lists From and To values. (Advanced SAS exercise go for it!!)

  • De DuplicatingReconcile different representations of the same entityDone after standardizing. Usually requires multi-field testing.May use probabilistic logic, depending on the application.Should produce a change log.

  • CorrectingIdentifying and correcting values that are wrongVery difficult to do. Usually based off of exception reports or range checks.