getting data

75
Getting data The first step in data analysis

Upload: suchi

Post on 23-Feb-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Getting data. The first step in data analysis. Learning Objective. What is a relational database? Contact your DBA for how to connect to your database(s)? How to write queries using PROC SQL. Using SAS/BASE® to connect to third-party relational data base software to extract data needed for - PowerPoint PPT Presentation

TRANSCRIPT

Getting data

Getting dataThe first step in data analysisLearning ObjectiveUsing SAS/BASE to connect to third-party relational data base software to extract data needed forprogram evaluationresearch using administrative dataoperational reports e.g. routine surveillance1SHRUG, 2014-05-02What is a relational database?Contact your DBA for how to connect to your database(s)?How to write queries using PROC SQL

What is a relational database?Set of tablestables made up of rows and columnsTrade names of relational databases (RDB):Oracle, Teradata, SQL Server, DB2, AccessRDB is software which is designed to retain large amounts of datatransactional DBreporting/warehousing DB2SHRUG, 2014-05-02What is a relational database?Transactional DB designed to increase the speed for front-end userscomplex table and table join structuresWarehousing DB designed for efficient storage and retrieval for reportingsimpler table designs and table join structures Queries for either design use same syntax (code)queries for warehouses will be simpler to write

3SHRUG, 2014-05-02What is a relational database?Why use relational databases?relational databases use a concept called normalizationNormalization reduces the amount of redundant data and allows for updates to data with less errorThere are degrees of normalizationfirst degreesecond degreethird degree and higher degrees

4SHRUG, 2014-05-02First degree normalizationFirst degree normalizationeach row pertains to a single entity: a patient, an encounter, a physicianeach column pertains to a characteristic of the entity: e.g. date of birth, sex, date of encounter, etcIDFirstNameGenderBirthCityBirthCountry0001JohnMMonctonCanada0002DevbaniFKolkataIndiaTable 1: Subjects with demographic information5SHRUG, 2014-05-02Violation of first degree normalizationSubjIDFirstNameGenderBirthCityBirthCountry0001John43MonctonNew Brunswick0002RahaFWest BengalIndiaWhat impact does violating the first degree normalization have on your queryif you want all patients born in Canada?if you want all male patients?

Table 1: Subjects with improper 1NF6SHRUG, 2014-05-02Second degree normalizationTable 2 has employer information about rows in Table 1

The table above has some redundant information:name is repeated from Table 1, province is embedded in the postal codeBetter design two or even 3 tables

NameCityProvPostalCodeJohnHalifaxNSB3K 6R8DevbaniHalifaxNSB3H 2Y9Table 2: Business addresses7SHRUG, 2014-05-02Second degree normalizationSubjIDPostalCode0001B3K 6R80002B3H 2Y9PostalCodeCityProvB3K 6R8HalifaxNSB3H 2Y9HalifaxNSTable 2: Revised with 2NFTable 3: Creating a secondary table for 2NF8SHRUG, 2014-05-02Second degree normalizationTable 2 now no longer contains name its replaced with the subject IDto get the subjects name we link the table to the table in the first example, using SUBJID/ID column we get the province and city by linking Table 2 and 3 using the POSTALCODE columnSUBJID is a primary key in Tables 1 and 2POSTALCODE is a foreign key in Table 2, but a primary key in Table 39SHRUG, 2014-05-02Primary/Foreign Keysprimary key a column or combination of columns that uniquely identify each row in the tablee.g. patient medical record needs at least 3 columns to identify a unique record: patient ID, date of encounter, and provider IDforeign key a column or combination of columns that is used to link data between two tables

10SHRUG, 2014-05-02Questions about 2NF?Can you see the advantage of splitting the data into different tables?share examples of your data where normalization is usedhigher degrees of normalization work similarly to the examples aboveyou have to go through more tables for higher levels of normalization in order to link to the data that you need11SHRUG, 2014-05-02Getting access to data: What do you need from DBA?Explain to DBA that you need to query data, but have no need to write to the database this helps them to determine where you belong on a user matrixDBA or IT install necessary software on your machineGoogle has lots of information on SAS ConnectSAS Connect documentation12SHRUG, 2014-05-02How SAS authenticatesUser name is provided by DBA/ITIn this example the password is held in the macro DBPASS

Statement to have Oracle print any messages to the SAS log

proc sql;connect to oracle (user = password="&dbpass path = prod );%put &sqlxmsg;

This is an example of pass-through code13SHRUG, 2014-05-02You can mask your password, a practice that we highly recommend, using PROC PWENCODE.13Using a LIBNAME to connectRecall that slide 13 showed pass-through facility in SASmost of the query is done on the databaseCan use libname statement to connect instead of pass-throughadvantage to this method is that you are programming in SAS (using SAS functions and formats)SAS determines which program (SAS or RDB) will handle statements more efficiently

14SHRUG, 2014-05-02Using a LIBNAME to connectExample using a libname statement:

libname onco odbc dsn='Oncolog' schema=dbo;

1.2.3.The name of the libraryTells SAS that you are using an ODBC engineDSN use the name of the database that was used to set up the odbc connectionNOTE: schema statement is not always required

15SHRUG, 2014-05-02Seeing your data - ViewsOnce view is created, you use the EXPLORER tab in SAS and use as normal dataset

16SHRUG, 2014-05-02Seeing your data - Views

Using the view columns in SAS EXPLORER17SHRUG, 2014-05-02Seeing your data - Views

Double click on table to get to see the dataNOTE: columns that identify personal information have been removed from this screen shot18SHRUG, 2014-05-02Other ways to view dataYou may have software from the RDB: TOAD (for Oracle)SQL Developer (for Oracle)SQL ServerTeradataAll vendors may have some limited function development software that allows:Viewing dataViewing the type of a column: char, num, date, etc.Writing SQL queries19SHRUG, 2014-05-02Sample view from SQL Developer

20SHRUG, 2014-05-02Syntax: Single table - 1 of 2PROC SQLDATA STEPproc sql;create asselect , , etcfrom where quit;data ;set ( keep= where=());run;Example: Create a dataset (table) with men aged 50 to 74. Assume the source table is called demographics and contains variables: subjectID, age and sex21SHRUG, 2014-05-02Syntax: Single table 2 of 2PROC SQLDATA STEPproc sql;create table men5074 asselect subjectID , agefrom work.demographicswhere sex=M and age between 50 and 74 ;quit; data men5074 (drop=sex);set work.demographics (keep=subjectid sex age where=(sex='M' and 50