sas etl tool.docx

22
Using SAS/BASE to load CSV extracts We will create a few SAS programs to load CSV files with dimensions data and create dynamically additional dimension tables. Lesson assumptions and objectives: We have two dimension extracts with products and customers (see screenshot below) and also countries which will complement the customers dimension in further processing Two additional products data (prodtype and prodage) will be generated manually in the SAS program and will be stored in a fact table We will load the csv extracts into the SAS work library and all files used for the processing will also be stored in the Work library We will generate a surrogate key column for the products and customers dimension (idp and idc accordingly) The appropriate formats will be applied during the extracts load Please refer to the comments (in green) included in the SAS program attached below. * read customers definition from a csv file; data customers; infile 'D:\business_scenario\src_customers.csv' delimiter = ',' MISSOVER firstobs=2; informat idc 8.0 informat CUST_ID $6.; informat CUST_NAME $30. ; informat CUST_GROUP $20. ; informat CUST_SEGMENT $20. ; informatCUST_COUNTRY_ID$2. ;

Upload: usha85

Post on 19-Nov-2015

213 views

Category:

Documents


0 download

DESCRIPTION

SAS ETL Tool.docx

TRANSCRIPT

Using SAS/BASE to load CSV extractsWe will create a few SAS programs to load CSV files with dimensions data and create dynamically additional dimension tables.

Lesson assumptions and objectives: We have two dimension extracts with products and customers (see screenshot below) and also countries which will complement the customers dimension in further processing Two additional products data (prodtype and prodage) will be generated manually in the SAS program and will be stored in a fact table We will load the csv extracts into the SAS work library and all files used for the processing will also be stored in the Work library We will generate a surrogate key column for the products and customers dimension (idp and idc accordingly) The appropriate formats will be applied during the extracts load

Please refer to the comments (in green) included in the SAS program attached below. * read customers definition from a csv file;data customers; infile 'D:\business_scenario\src_customers.csv' delimiter = ',' MISSOVER firstobs=2; informat idc 8.0 informat CUST_ID $6.; informat CUST_NAME $30. ; informat CUST_GROUP $20. ; informat CUST_SEGMENT $20. ; informatCUST_COUNTRY_ID$2. ; input ID CUST_ID CUST_NAME CUST_GROUP CUST_SEGMENT CUST_COUNTRY_ID; idc = _n_; * this variable will be used as a surrogate key; drop ID; run;* read products definition from a csv file;data products; infile 'D:\business_scenario\src_products.csv' delimiter = ',' MISSOVER firstobs=2; informat idp 8.0 ; informat PROD_ID $7. informat PROD_NAME $30. ; informat PROD_NAME_ENGLISH $30. ; informat PROD_ZONE $15. ; informat PROD_GROUP $15. ; idp = _n_;* this variable will be used as a surrogate key; input PROD_ID PROD_NAME PROD_NAME_ENGLISH PROD_ZONE PROD_GROUP;run;* read countries definition from a csv file; data countries; infile 'D:\business_scenario\src_countries.csv' delimiter = ',' DSD MISSOVER firstobs=2; informat COUNTRY_ID $2. informat COUNTRY_TEXT $50. ; informat REGION_TEXT $30. ; input COUNTRY_ID COUNTRY_TEXT REGION_TEXT ;run;* enter manually product types; data prodtype;input type_id type_name $;cards;1 Seeds2 Plants;* enter manually product age types; data prodage;input age_id age_name $;datalines;1 02 1-23 3-54 6-105 11-206 21-100;

Customers and Products CSV text files contents:

Populate Dimensions in a Fact table under SASPlease find below an example of how to populate random dimensions in a fact table Lesson assumptions and objectives: In the first few lines of a program there is a section where we can set up parameters. For this tutorial we will generate data for years 2005-2007 and we want to have 2000 observations in our fact table. The date will be formatted in DD-MM-YYYY format (For instance 15-01-2007) The weekend sales should be limited compared to weekdays. Althought our business runs on Sundays it has less sales than on Saturdays. We randomly generate a date, a customer ID and product ID for each data row. To make the model more interesting We use Random and Uniform distributions to populate random data. We also fill in product type and product age variables At the end we will generate a simple SAS chart to check if the data meets our expectations

Please refer to the comments (in green) included in the SAS program attached below. * create a dataset in a work folder ;data facts_date_ids;* initial parameters;dt_min='01jan2005'd;dt_max='31dec2007'd;cnt_cust=21;cnt_prod=22;prod_type_cnt=2;prod_age_cnt=6;obs_to_populate=2000;* format date to dd-mm-yyyy format ;format DT DDMMYYD10.;*loop to generate records;DO i=1 TO obs_to_populate; /* generate random date between min and max date*/ DT=dt_min+ranuni(0)*(dt_max-dt_min); /*make less sales in sundays than saturdays and weekends less than in weekdays */ if weekday(dt)=1 then dt=dt_max-ranuni(1)*(dt_max-dt_min); if weekday(dt)=7 then dt=dt_max-ranuni(2)*(dt_max-dt_min); if weekday(dt)=1 then dt=dt_max-ranuni(3)*(dt_max-dt_min); /* customer - when absolute value of normal distribution random (rannor) is greater then a limit, then populate a number in second half of the population */ idc = int(1+abs(rannor(123))*10); * if populated value exceeds the limit, populate it again using normal distribution (ranuni); if idc>cnt_cust then idc = int(cnt_cust/2+ranuni(123)*(cnt_cust/2)); /* products - the same as customers - when absolute(rannor) is greater then limit, populate a number in second half of the population */ idp = int(1+abs(rannor(3456))*20); * if populated value exceeds the limit, populate it again using normal distribution (ranuni); if idp>cnt_prod then idp = int(cnt_prod/2+ranuni(123)*(cnt_prod/2)); /*prod type*/ prod_type = int(1+ranuni(0)*prod_type_cnt); /*product age*/ if prod_type=1 then prod_age=1; * set age for seeds to zero; else prod_age = int(2+ranuni(0)*(prod_age_cnt-1)); output; END;/*keep only the relevant fields*/keep dt idc idp prod_type prod_age;run;*quick look at the populated figures ;PROC MEANS data=facts_date_ids min max MAXDEC=0; class prod_age; var prod_type ;RUN;*looking at the data on a graph may be also interesting ;proc gchart data=facts_date_ids; pie prod_age / discrete;run;quit;

Use PROC MEANS and PROC GCHART to check the newly generated data:

Generate measures in a Fact table under SASPlease find below an example of how to generate measures in a fact table and perform some calculations on that figures. The measures will be generated in a random fashion, however they will apply business rules described in business scenario in the introduction for the tutorial. Lesson assumptions and objectives: We already have a table filled in with the dimensional data (facts_date_ids) We will use a Quantity Modifier variable which will help us stick to the business requirements and the business scenario. The modifier will determine the range of quantities sold based on products and customers data. For example the value of the modifier will be higher for wholesales than for retail sales and lower for grown-up plants than for seedlings. Price is calculated based on the product surrogate key and age. So the higher is the product key and the older is product, then it is worth more. Revenue is calculated without any rebates or discounts at that stage.

Please refer to the comments (in green) included in the SAS program attached below. *temportary fact dataset;data facts_measures;set facts_date_ids;*format values;format PRICE 8.2;format QUANTITY 8.0;format REVENUE 8.2;/* quantity modifier is a value which applies model assumptions regarding *//* differencies between amount of products sold to retailers vs wholesalers, etc. */q_modifier=1;/*seeds are sold in higher quantities than plants*/if prod_type=1 then q_modifier=q_modifier*3;/*wholesalers buy more than retailers*/if cust_segment='Wholesale' then q_modifier=q_modifier*10;/* Nurseries buy more than other wholesalers */if cust_group='Nurseries' then q_modifier=q_modifier*20;/* younger plants are bought in bigger quantities */if prod_age>1 then do; q_modifier=10*q_modifier*(1/(prod_age**2));end;*calculate FINAL RANDOM QUANTITY with the use of modifier ;QUANTITY=int(1+q_modifier*abs(rannor(456)));/*seeds are packed by 10*/if prod_type=1 then QUANTITY=QUANTITY*10;/*calculate FINAL PRICE*//*price is low for seeds (idp=1) and is getting higher with age */PRICE = (idp**0.3)*((prod_age)**4)/5;/*calculate REVENUE*/REVENUE=price*quantity;run;/*test the newly generated data*/PROC MEANS data=facts_measures mean min max MAXDEC=2 noobs; class cust_group prod_age; var quantity ;RUN;Create sales fact table in a star schema DatawarehouseIn this lesson we will create a fact table in star schema datawarehouse architecture. Also, we will produce a statistics summary of newly generated data and perform some validations and checks. As the last task in this lesson we will generate CSV extracts with dimensions and facts data.

Lesson assumptions and objectives: To create the final sales fact table, it is necessary to get information from 5 tables. The sales figures will be stored on a day detail level SAS merge function might be used for this but we will use PROC SQL procedure as it will be simplier. Here you can see the power of SAS processing, where different data handling techniques can be used together. The fact table will have 8 columns - 5 dimensions and 3 measures. We will also create a CSV files with facts and dimensions data which will serve as an input for the Cognos PowerPlay model or any other reporting tool. The data might be also analyzed in SAS analysis tools.

Please refer to the comments (in green) included in the SAS program attached below. /*we can use SQL procedure to join the tables together using inner joins and create a final fact table */proc sql;CREATE TABLE CognosBI.Sales_Facts ASSELECT fm.DT as DT, cus.cust_id as CUST_ID, prd.PROD_ID as PROD_ID, pa.age_name as PROD_AGE, pt.type_name as PROD_TYPE, fm.PRICE, fm.QUANTITY, fm.REVENUE FROM facts_measures fm, Customers cus, Products prd, ProdAge pa, ProdType ptWHERE fm.idc=cus.idc AND fm.idp=prd.idp AND fm.prod_age=pa.age_id AND fm.prod_type=pt.type_idORDER BY DT;quit;/* we also need a customer table joined with countries */proc sql;CREATE TABLE CognosBI.Dim_Customers ASSELECT cus.CUST_ID, cus.CUST_NAME, cus.CUST_GROUP, cus.CUST_SEGMENT, cus.CUST_COUNTRY_ID, reg.COUNTRY_TEXT, reg.REGION_TEXT FROM Customers cus, Countries regWHERE cus.CUST_COUNTRY_ID=reg.COUNTRY_ID;/* products dimension table can be copied easily using a SAS dataset */data CognosBI.Dim_Products;set products(drop=idp); * drop the idp column which will not be used ;run;/* extract the dimensions and facts tables into csv files which will be used for processing*/proc export data=CognosBI.Sales_Facts outfile='D:\business_scenario\sales-cognosbi.csv' dbms=csv replace;run;proc export data=CognosBI.Dim_Customers outfile='D:\business_scenario\dim_customers.csv' dbms=csv replace;run;proc export data=CognosBI.Dim_Products outfile='D:\business_scenario\dim_products.csv' dbms=csv replace;run;

Randomly generate star schema fact table in SAS:

Create costs fact tablesIn this lesson we will create a fact table with costs. Costs are provided on a monthly basis (variable cost) and yearly basis (fixed cost).Our sales data is grouped on the day detail level and this makes this data model an example of a Fact Constellation Schema.

A result of the learning steps will be two CSV extracts with the costs data.

Lesson assumptions and objectives: The costs data will be randomly generated, using revenue total as a base variable Costs are divided into fixed and variable costs and allocated on a different date detail level We will create two separate SAS tables and CSV extracts for each of the cost values

Please refer to the comments (in green) included in the SAS program attached below. * CALCULATE FIXED COST ;* create a temporary table with total revenue which will be an input to the fix cost calculation;proc sql;create table fixcost asselect min(year(dt)) as yr, sum(revenue) as tot_revenuefrom cognosbi.sales_facts;quit;/* populate the fix costs which would be around 40 percent of all sales, growing each year */data cognosbi.fixcosts(keep=yr fixcost);set fixcost;obs_to_populate=3; * we will populate three years of data;format fixcost 8.2;* fixed cost base record taken from the first record beeing 40% of the revenue;base_fixcost = 0.4*(tot_revenue/obs_to_populate); *total revenue divided by number of years ;fixcost=round(base_fixcost,500); * round the result to 500 ;output; * output the first base record ;* populate data for the following years ;DO i=2 TO obs_to_populate; yr+1; /* increase year */ *calculate a random cost increase ranging from 0 to 15 percent ; costincr=1+0.15*ranuni(i); *calculate the final fix cost ; fixcost=round(fixcost*costincr,500); OUTPUT; * write to the output table;END;run;* CALCULATE VARIABLE COST ;* create a temporary varcost table ;proc sql;create table varcost asselect month(dt) as mth, year(dt) as yr, sum(revenue) as mth_revenuefrom cognosbi.sales_factsgroup by mth, yr;quit;/* generate random variable costs which will be around 20 percent of monthly sales */data cognosbi.varcosts(keep=newdt varcost);set varcost;* format the date to a mm-yyyy format;format newdt mmyyd7.;format varcost 8.2;* create a date which is the first day of each month ;newdt=mdy(mth,1,yr);* delta which is based on a normal distribution random number generator (can be positive or negative) ;delta = rannor(0)*mth_revenue*0.05;* calculate the final variable cost ;varcost=0.2*mth_revenue+delta;run;/* generate costs extracts */proc export data=CognosBI.fixcosts outfile='D:\business_scenario\f_fixcost.csv' dbms=csv replace;run;proc export data=CognosBI.varcosts outfile='D:\business_scenario\f_varcost.csv' dbms=csv replace;run;

Costs fact tables generated in SAS:

Implement ETL Process in SASThe program below will run the whole process in a sequence which may be considered as a representation of ETL Process in SAS. The ETL process could also be set up in SAS ETL Studio or SAS Warehouse Administrator which would be far more sophisticated solution.

Lesson assumptions and objectives: We are ready to execute the complete SAS program flow in a sequence After running the loading, transformation and extraction programs, we will have a look at the data in sas using PROC MEANS procedure. We will check sum, average, min & max values of all the measures and make sure that the sales data corresponds to out business scenario and apply our business assumptions

Please refer to the comments (in green) included in the SAS program attached below. * run all programs in an ETL sequence ;%include 'D:\business_scenario\1-read_dimensions.sas';%include 'D:\business_scenario\2-gen_facts_date_ids.sas';%include 'D:\business_scenario\3-gen_facts_measures.sas';%include 'D:\business_scenario\4-create-fact-table.sas';%include 'D:\business_scenario\5-gen-costs.sas';* Use PROC MEANS to analyze populated figures ;* We will check sum, average, min & max values of all measures ;* the MAXDEC parameter indicates that we want to limit numbers to 2 decimal places ;PROC MEANS data=CognosBI.Sales_Facts mean min max sum MAXDEC=2; class prod_type prod_age; var price revenue quantity ; RUN;

Summary and statistics for newly generated measures: