alan elliott
DESCRIPTION
Alan Elliott. Dealing with Missing Values. Title "Dealing with Missing Values in SQL"; PROC SQL ; select INC_KEY,GENDER, RACE, INJTYPE, case when ISS=- 81 then . else ISS end as ISS, case when EDGCSTOTAL=- 81 then . else EDGCSTOTAL end as EDGCSTOTAL, - PowerPoint PPT PresentationTRANSCRIPT
SAS SQL Part 2Alan Elliott
Dealing with Missing ValuesTitle "Dealing with Missing Values in SQL";PROC SQL;select
INC_KEY,GENDER, RACE, INJTYPE,case when ISS=-81 then . else ISSend as ISS,
case when EDGCSTOTAL=-81 then . else EDGCSTOTALend as EDGCSTOTAL,case when AGE=-81 then . else AGEend as AGE
from "C:\sasdata\trauma_sm";quit;
SEE SQLMISSING.SAS
Partial Output
INC_KEY GENDER RACE INJTYPE ISS EDGCSTOTAL AGE468879 Female Black Blunt 1 . 9.1468942 Female Black Blunt 20 . 17.7468961 Female White, not of
Hispanic OriginBlunt 5 . 16.6
468971 Female Black Blunt 5 15 6.5469030 Male Black . . 6.7469055 Male Black 1 15 10.8487580 Male Hispanic Penetrating 25 6 17.7597075 Male White, not of
Hispanic OriginBlunt 9 15 14.6
603091 Female White, not of Hispanic Origin
Blunt 17 3 5.1
Missing Values
Character variables
• Take INJTYPE out of initial list of variables on the SELECT statement and add the following code:
end as AGE,case when INJTYPE="Burn" then "" else INJTYPEend as INJTYPE
• And rerun
Results
603227 Male White, not of Hispanic Origin
26 15 16 Blunt
603228 Male White, not of Hispanic Origin
1 15 7.6 Blunt
603237 Female White, not of Hispanic Origin
17 3 16.8 Blunt
603238 Male Black 2 15 15.7 Blunt603243 Male White, not of
Hispanic Origin14 15 7.8 Blunt
603246 Male White, not of Hispanic Origin
1 15 1.4
603259 Male White, not of Hispanic Origin
1 15 14 Penetrating
603260 Female White, not of Hispanic Origin
9 15 13.3 Blunt
603262 Female White, not of Hispanic Origin
26 14 7.1 Blunt
WAS BURN
Combined CASE / ConditionalPROC SQL;select
INC_KEY,GENDER,RACE, AGE,DISSTATUS,case when AGE LT 10 and DISSTATUS in ("Dead") then "CHILD DEATH" when AGE GE 10 and DISSTATUS in ("Dead") then "OTHER DEATH" ELSE "Alive"end as TYPEDEATH
from "C:\sasdata\trauma_sm";quit;
• SEE SQLCASE.SAS
Partial Output
603157 Female Black 3.5 Alive Alive603158 Male White, not of Hispanic
Origin7.7 Alive Alive
603169 Female White, not of Hispanic Origin
5.1 Alive Alive
603170 Male White, not of Hispanic Origin
14.7 Alive Alive
603173 Male White, not of Hispanic Origin
5.7 Dead CHILD DEATH
603188 Male White, not of Hispanic Origin
17.8 Alive Alive
603192 Male White, not of Hispanic Origin
12.8 Dead OTHER DEATH
603196 Male White, not of Hispanic Origin
15.6 Alive Alive
603200 Female White, not of Hispanic Origin
13.7 Alive Alive
Order by TYPEDEATH (Descending)
Modify the code to read …
…end as TYPEDEATH
from "C:\sasdata\trauma_sm"ORDER BY TYPEDEATH DESC;
Results
INC_KEY GENDER RACE AGE DISSTATUS TYPEDEATH603192 Male White, not of Hispanic
Origin12.8 Dead OTHER DEATH
603122 Male Hispanic 14.1 Dead OTHER DEATH603294 Female White, not of Hispanic
Origin10.6 Dead OTHER DEATH
603173 Male White, not of Hispanic Origin
5.7 Dead CHILD DEATH
603371 Male White, not of Hispanic Origin
7.0 Dead CHILD DEATH
603280 Male White, not of Hispanic Origin
18.0 Alive Alive
603262 Female White, not of Hispanic Origin
7.1 Alive Alive
603463 Male Asian or Pacific Islander
12.4 Alive Alive
Summarize and CountPROC SQL;select
case when AGE LT 10 and DISSTATUS in ("Dead") then "CHILD DEATH" when AGE GE 10 and DISSTATUS in ("Dead") then "OTHER DEATH" ELSE "Alive"end as TYPEDEATH,count(calculated TYPEDEATH) as COUNTDEATH
from "C:\sasdata\trauma_sm"GROUP BY TYPEDEATH;quit;
SEE SQLSUMMARY1.SAS
(NOTE: Take calculates out of above statement, and observe error.)
Results
TYPEDEATH COUNTDEATHAlive 95CHILD DEATH 2OTHER DEATH 3
Reorder the table
Change the code (at end to read)
from "C:\sasdata\trauma_sm"GROUP BY TYPEDEATHorder by typedeath;
Results
TYPEDEATH COUNTDEATH
OTHER DEATH 3
CHILD DEATH 2
Alive 95
SQL Summary FunctionsSummary Function DescriptionAVG, MEAN Average or mean of valuesCOUNT, FREQ, N Aggregate number of non-missing valuesCSS Corrected sum of squaresCV Coefficient of variationMAX Largest valueMIN Smallest valueNMISS Number of missing valuesPRT Probability of a greater absolute value of Student’s tRANGE Difference between the largest and smallest valuesSTD Standard deviationSTDERR Standard error of the meanSUM Sum of valuesSUMWGT Sum of the weight variable values which is 1T Testing the hypothesis that the population mean is zeroUSS Uncorrected sum of squaresVAR Variance
Using Some Summary Functionsproc sql;select count(brand) as Tot_Cars, sum(minivan) as TOT_Minivans, min(CITYMPG) as MIN_MPG, max(CITYMPG) as MAX_MPG, SUM(CITYMPG)/COUNT(CITYMPG) as AVG_MPGfrom sasdata.cars;quit;
SEE SQLSUMMARY2.SASTot_Cars TOT_Minivans MIN_MPG MAX_MPG AVG_MPG1081 30 10 60 19.28955
Add group statement
from sasdata.cars group by minivan;quit; Tot_Cars TOT_Minivans MIN_MPG MAX_MPG AVG_MPG1051 0 10 60 19.3244530 30 16 20 18.06667
Compare Sort (Data Step)DATA SORTDATA MYDATA;INPUT @1 LAST $20. @21 FIRST $20. @45 PHONE $12.;Label LAST = 'Last Name' FIRST = 'First Name' PHONE = 'Phone Number';DATALINES;Reingold Lucius 201-876-0987Jones Pam 987-998-2948Etc…;*-------- Modify to sort by first name within last (by last first);PROC SORT; BY LAST FIRST; PROC PRINT LABEL NOOBS;TITLE 'ABC Company';TITLE2 'Telephone Directory';RUN;
RUN this code, observe results.
SEE DATASORT.SAS
Results from DATA statement
ABC CompanyTelephone Directory
Last Name First Name Phone NumberAdams Abby 214-876-0987Baker Crusty 222-324-3212Jones Jackie 456-987-8077Jones Pam 987-998-2948Reingold Lucius 201-876-0987Smith Arnold 234-321-2345Smith Bev 213-765-0987Smith John 234-943-0987Zoll Tim Bob 303-987-2309
Sort Using SQL
PROC SQL;SELECT LAST LABEL="Last Name", FIRST LABEL="First Name", PHONE LABEL "Phone Number"
from MYDATAORDER by LAST, FIRST;
QUIT;SEE SQLSORT.SASNote – variables appear in table in order selected…
Results for SQL SortABC Company
Telephone Directory
Note – easier to order variable names.
First Name Last Name Phone NumberAbby Adams 214-876-0987Crusty Baker 222-324-3212Jackie Jones 456-987-8077Pam Jones 987-998-2948Lucius Reingold 201-876-0987Arnold Smith 234-321-2345Bev Smith 213-765-0987John Smith 234-943-0987Tim Bob Zoll 303-987-2309
Appending/Concatenating Two Files
• Recall from the Data Step, to append two data files you can use the code
DATA NEW; SET OLD1 OLD2; RUN;
(See SQLAPPEND.SAS)
Results for Data Append
Obs SUBJ AGE YRS_SMOKE MARRIED1 001 34 12 .2 003 44 14 .3 004 55 35 .4 006 21 3 .5 011 33 11 16 012 25 19 07 023 65 45 18 032 71 55 1
Append Files using SQL
PROC SQL; select * from old1 union select * from old2;QUIT;
UNION-means concatenate the query results. It produces all the unique rows that result from both queries
Results (same as Data Append)
SUBJ AGE YRS_SMOKE MARRIED001 34 12 .003 44 14 .004 55 35 .006 21 3 .011 33 11 1012 25 19 0023 65 45 1032 71 55 1
Basic SQL Operators
UNION produces all unique rows from both queries.
EXCEPT produces rows that are part of the first query only.
INTERSECT produces rows that are common to both query results.
OUTER UNION concatenates the query results.
Combine two or more queries in various ways by using the following set operators:
Duplicate records
• Suppose there are duplicate records.• See SQLAPPEND2.SAS
DATA OLD1;INPUT SUBJ $ AGE YRS_SMOKE;datalines;001 34 12003 44 14004 55 35006 21 3011 33 11 ;
DATA OLD2;INPUT SUBJ $ AGE YRS_SMOKE MARRIED;datalines;006 21 3 .011 33 11 1012 25 19 0023 65 45 1032 71 55 1;RUN;
This record added
This record added
Union appends, keeps unique rowsPROC SQL; select * from old1 union select * from old2;QUIT;One row 6 is unique. Two row 11’s are unique. (UNION keeps all unique.)
SUBJ AGE YRS_SMOKE MARRIED001 34 12 .003 44 14 .004 55 35 .006 21 3 .011 33 11 .011 33 11 1012 25 19 0023 65 45 1032 71 55 1
This is the same code from before – only difference is the duplicated records in the data sets.
Union all
• To keep all rows, use UNION ALL
PROC SQL; select * from old1 union all select * from old2;QUIT;
Add ALL to the code and re-run.
Results – Union All
SUBJ AGE YRS_SMOKE MARRIED001 34 12 .003 44 14 .004 55 35 .006 21 3 .011 33 11 .006 21 3 .011 33 11 1012 25 19 0023 65 45 1032 71 55 1
Both SUBJ 6 records included (even though not unique).
EXCEPT
• To keep only the data from the first data set that are not in the 2nd set (but all variables) use EXCEPTPROC SQL; select * from old1 except select * from old2;QUIT;
Run this code and observe output.
Except Output
SUBJ AGE YRS_SMOKE MARRIED001 34 12 .003 44 14 .004 55 35 .011 33 11
Note – Record 6 is the same, so it is not kept. Record 11 is different so it is kept.
Switch Data Set order
PROC SQL; select * from old2 except select * from old1;QUIT;
Run this code and observe output.
Output
SUBJ AGE YRS_SMOKE MARRIED011 33 11 1012 25 19 0023 65 45 1032 71 55 1
Note – (Same as before) Record 6 is the same, so it is not kept. Record 11 is different so it is kept.
Except ALLdatalines;001 34 12003 44 14004 55 35006 21 3006 21 3011 33 11 ;
Suppose there was a duplicate record 006 in the first data set. Using EXCEPT, record 6 would not appear in the result because there is a record 6 in the second data set. If you want a duplicate record that is not a duplicate matched in the 2nd data set to appear in the result, use EXCEPT ALL
See SQLAPPEND2a.SAS
Added duplicate record
Except ALL
PROC SQL; select * from old1 except all select * from old2;QUIT;
SUBJ AGE YRS_SMOKE MARRIED001 34 12 .003 44 14 .004 55 35 .006 21 3 .011 33 11 .
6 is a duplicate record in OLD1 and there is not a duplicate 6 in
OLD 2, so it appears in the results
Intersect
• The INTERSECT command returns only those records that occur in both data sets.
• Change EXCEPT ALL to INTERSECTPROC SQL; select * from old1 intersect select * from old2;QUIT;
Intersect Results
SUBJ AGE YRS_SMOKE MARRIED006 21 3 .
Only record 6 was duplicated in both data sets.
Compare Union with Outer Union
• UNION- produces all unique rows from both queries.
• OUTER UNION – concatenates the series results
• See SQLAPPEND3.SAS
Compare OutputPROC SQL; select * from old1 UNION select * from old2;QUIT;
PROC SQL; select * from old1 OUTER UNION select * from old2;QUIT;
Results of Union
Results of UNION
SUBJ AGE YRS_SMOKE MARRIED001 34 12 .003 44 14 .004 55 35 .006 21 3 .011 33 11 .011 33 11 1012 25 19 0023 65 45 1032 71 55 1
Results of Outer Union
SUBJ AGE YRS_SMOKE SUBJ AGE YRS_SMOKE MARRIED001 34 12 . . .003 44 14 . . .004 55 35 . . .006 21 3 . . .006 21 3 . . .011 33 11 . . .
. . 006 21 3 .
. . 011 33 11 1
. . 012 25 19 0
. . 023 65 45 1
. . 032 71 55 1
Note: SQL allows you to create a data set with DUPLICATE variable names.
Cartesian Join
• Combines ALL rows from one file with ALL rows from another.
PROC SQL; select * from old1, old2;QUIT;SEE SQLCARTESIAN_JOIN.SAS
Cartesian Join
SUBJ AGE YRS_SMOKE SUBJ AGE YRS_SMOKE MARRIED001 34 12 006 21 3 .001 34 12 011 33 11 1001 34 12 012 25 19 0001 34 12 023 65 45 1001 34 12 032 71 55 1003 44 14 006 21 3 .003 44 14 011 33 11 1003 44 14 012 25 19
Note SUBJ 1 appears 5 times
Using Table Aliases
select a.subj, a.age, b.subj as sub_from_b, etc…
from old1 a, old2 b
Table Alias allows you to distinguish variables from different tables without ambiguity.
Note “a” variables prefix specifies that the variable
is from the table “old1” since “old1” is labeled as
the “a” table below.
Table old1 is labeled as table “a” in this code.
Inner Join (Using Table Alias)PROC SQL; select a.subj, a.age,
b.subj, b.age, b.married as Married from old1 a, old2 b
where a.subj=b.subj;QUIT;
In an INNER JOIN, only observations with both key values matching are selected.
SEE SQL_INNER_JOIN.SAS
Inner Join Code
PROC SQL; select a.subj, a.age,
b.subj, b.age, b.married as Married from old1 a, old2 b
where a.subj=b.subj;QUIT;
“Where” limits the join to those that are in BOTH Tables
Note Married is from table “b” but will be called Married in output
Inner Join Results
Results of INNER (Conventional) Join
SUBJ AGE SUBJ AGE Married006 21 006 21 .006 21 006 21 .011 33 011 33 1
Inner Join (1-to-1 merge) Example 2
PROC SQL; select * from old1, old2
where old1.subj=old2.subect;QUIT;
Use the table names as the alias.SEE SQL_INNER_JOIN2.SAS
Results
SUBJ AGE YRS_SMOKE SUBJECT SBP MARRIED003 44 14 003 110 1001 34 12 001 120 .004 55 35 004 90 0006 21 3 006 100
Note – data for both files where not in order. Only those in both tables with a matching key variable are included in the result.
EXERCISE – Add the phrase “order by old1.subj” to put the table in Subject order.
One to Many MergeSuppose you have data like this… you want to match building to employees.
Data LOC;input BUILDING $ Location $;datalines;A1 DALLASA2 WACOA3 HOUSTON;RUN;DATA EMPLOYEE;input EID $ LOC $ ROOMNUMBER;datalines;001 A2 103003 A1 100005 A1 1001006 A3 12002 A1 101.1;run;
SQL Code 1-to-Many
PROC SQL; select * from LOC, EMPLOYEE where LOC.BUILDING=EMPLOYEE.LOC order by EMPLOYEE.EID;quit;
Use the table names as the alias.SEE SQL_INNER_JOIN3.SAS
Results of 1-to-Many
BUILDING Location EID LOC ROOMNUMBER
A2 WACO 001 A2 103A1 DALLAS 002 A1 101.1A1 DALLAS 003 A1 100A1 DALLAS 005 A1 1001A3 HOUSTON 006 A3 12
Each employee is matched to a building.
End – Do Exercises