recipe 16 of data warehouse and business intelligence - the monitoring of the days on the data files...

18
How to have the monitoring of the days on the data files of a Data Warehouse Recipes of Data Warehouse and Business Intelligence Are you the right one ? Have you what I expect ? Have you lossed some piece ? DATA FILE

Upload: massimo-cenci

Post on 28-Jan-2018

143 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

How to have the monitoring of the days on the data files of a Data Warehouse

Recipes of Data Warehouse and Business Intelligence

Are you the right one ? Have you what I expect ?

Have you lossedsome piece ?

DATAFILE

Page 2: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

• In this article we focus on the management of the loading day of the data file, the reference day of the data, and the expected number of rows. These issues have already been covered briefly in some of my previous articles published on slideshare and on my blog. Now we see the practical application.

• How real case, we will use, as an example, the data file of MTF markets (Multilateral Trading Facilities). To the data file has been associated a "row" file that contains, within it, the number of rows expected in the data file itself.

• The control file, created by hand to this end, is composed of three lines:

#MTF CONTROL FILE OF 20160314ROWS = 160#END OF MTF CONTROL FILE OF 20160314

• We suppose that the data file should arrive every working day, and the reference day is the previous working day.

• The reference day is specified in the file name, but we must be careful, because the feeding system sets, as reference, the day of production of the data file and not the previous working day.

The use case

Page 3: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

• Based on the information mentioned above, to get the full control of the data file loading, the ETL system should provide me all the information necessary to fulfill the following requirements.

• We must have a clear vision of what are the characteristics of the data file, both general and purely technical nature. In particular, those linked to its name, the file structure, the way it is defined the reference day, the structure of the control file (if present)

• So, we will define the temporal characteristics of the data file by using a code that represents its management.

The control requirements

Page 4: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

• For convenience, I summarize the ways in which the feeding system can tell me the reference day.

The control requirements

A column of

data file

Inside the

data file

Where is the

reference

day of data ?

In the heading

of data file

In the tail of

data file

In the name

of data file

Missing, assume

the system date

Outside the

data file

Page 5: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

• We must have a clear vision of what is the internal structure of the data file, iewhat are the columns that constitute it. And for each column must be present as many as possible metadata.

• Both static, such as the type or length, that dynamic, as the presence of a domain of values, or if the column is part of the unique key.

The control requirements

Page 6: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

The control requirements

• We must have a calendar table, that, for each calendar day, tell me, simply duplicating the day, if I expect the arrival of the data file and what is the expected reference day in the data file of that day.

• If the data file contains more days, I need to know what is the range of days that I expect.

Page 7: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

The control requirements

• We need to know the final outcome of the processing. The final state and the time taken. If the upload has had problems, I need to know the error produced, and what is the programming module that generated it.

• If the outcome is negative, we have to know exactly why you are in error. For example, if the consistency check has failed, I need to know at what point it occurred.

Page 8: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

The control requirements

• We need to know the final outcome of the control about the loading day and the reference day.

• To get the final outcome of the controls, we have to think about implementing a control logic similar to that shown in the next figure.

• Dark green definitely the correct situations. In red, the alert situations. In light green, the ones presumably correct but that require attention.

Page 9: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

The control requirements

1 – OK(arrived and right day)

Expected day = referenceday ?

It had to arrive ?

Data file is arrived ?

2 - NOT OK( arrived but wrong day)

3 - OK(unespected file)

4 - NOT OK (unespected file and

wrong day)

5 - OK(maybe file)

6 - NOT OK(maybe file and wrong

day)

7 - NOT OK(missing file)

8 – OK(no file to load)

9 - OK(maybe file)

Expected day = referenceday ?

Expected day = referenceday ?

It had to arrive ?

yes

no

maybe

yes

no

maybe

yes

yes

yes

yes

no

no

no

no

Page 10: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

The control requirements

• We must have via e-mail the result of processing.

• Using the Micro ETL Foundation we can handle this situation and its control in a few steps.

MEF:Open the link:https://drive.google.com/open?id=0B2dQ0EtjqAOTQzZSaUlyUmxpT1kGo to the Mef_v2 folder and follow the instructions of the readme file.

The data file is in the folder .. \dat and is called mtf_export_20160314.csv. The control file with the expected number of rows is called mtf_export_20160314.row. It is present in the .. \datThe file that configures the data file fields is located in the .. \cft and is called mtf.csv

Page 11: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

The configuration of the data and control file

• The first step is to insert into a configuration table, which we will call IO_CFT for brevity, all the information that we know about the features of the data file that we load. Also, for this case, you need to enter in the IO_CFT table also information relating to the control file.

• The second step is to insert in the IO_CFT table, the information relative to the expected day of arrival of the data file. We must define a code, let's call FR_COD (File Reference Code) behind which there will be the load logic of a second configuration table that we will call IODAY_CFT. The FR_COD code represents the arrival frequency. For the moment, I have defined some commonly used values :• AD = Every day. It means that the data file must arrive every day. So, in

IODAY_CFT table, they will be setted all the days.• AWD = All working days. It means that the data file must only arrive on the

working days. So all holidays most Saturdays and Sundays will be null.• ? = I do not know when it comes, it is variable. Typical of monthly flows of which

no one knows precisely when available.• Based on the FR_COD code, the IODAY_CFT table will be loaded, by setting the

presence of the expected day in the FR_YMD field.

Page 12: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

Reference day configuration

• The third step is to insert in the IO_CFT table, information relating to the expected reference day.

• The DR_COD code must indicate what should be the reference day for data in the data file. I remember that the reference day must be present or implied. The same logic has been applied to FR_COD field also applies to DR_COD field. It will serve to set the IODAY_CFT. For the moment I have defined some commonly used values:• 0 = the reference date coincides with the current day.• 1 = the reference date coincides with the day before, that is, the current -1• 1W = indicates the first preceding business day.

• The configuration tasks of the IODAY_CFT table occurs only once in the process of the data file configuration. After, you no longer need to change.

• Note that the use of the codes is a way to quickly facilitate the setting of the IODAY_CFT table. Nobody blocks you, to manually modify the table or with ad-hoc SQL.

Page 13: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

Configuration of the correction factor

• The OFF_COD code present in IO_CFT indicates the correction factor to be applied to the reference day indicated by the feeding system. The OFF_COD does not act in control, but will act as a corrector of the day at run-time. For the moment I have defined some commonly used codes:• 0 = the reference day coincides with the day indicated by the feeding system.• 1 = the reference day coincides with the day before, that is, the current -1• 1W = the reference date coincides with the previous working day.

• The FROM_DR_YMD and TO_DR_YMD fields have the same meaning of the FR_COD field, but allow you to identify a range of possible reference days. For the moment, only one code has been defined• PM = the previous month of the current calendar day.

MEF:

The data file is in the folder .. \dat and is called mtf_export_20160314.csv. The control file with the expected number of rows is called mtf_export_20160314.row. It is present in the .. \datThe file that configures the file data field structure is located in the .. \cft and is called mtf.csvThe configuration file of the data file is called io_mtf.txt and is under the folder .. \cft. It has the following settings:

Page 14: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

The configuration file

IO_COD: MTF (file identificator)IO_DEB: Multilateral Trading Facilities (file description)TYPE_COD: FIN (file type - input file)SEC_COD: ESM (feeding system: ESMA)FRQ_COD: D (frequency - Daily)FILE_LIKE_TXT: mtf_export% .csv (generic name of the file without day)FILE_EXT_TXT: mtf_export_20160314.csv (name of the sample data file)HOST_NC:., (Priority on the decimal point)HEAD_CNT: 1 (number of rows in header)FOO_CNT: 0 (number of rows in tail)SEP_TXT :, (separator symbol if csv)START_NUM: 12 (starting character of the day in the name)SIZE_NUM: 8 (size of day)RROW_NUM: 2 (row of the control file in which there is the file rows number)RSTART_NUM: 8 (where begins the number of rows)RSIZE_NUM: 6 (size of the number)MASK_TXT: YYYYMMDD (format of the day)FR_COD: AWD (file reference code)DR_COD: 1W (day reference code)OFF_COD: 1W (offset on day reference)RCF_LIKE_TXT: mtf_export% .row (generic name of control file without day)RCF_EXT_TXT: mtf_export_20160314.row (name of the sample control file)FTB_TXT: NEWLINE (indicator of the row end for the Oracle external table)TRUNC_COD: 1 (indicating whether the staging table should be truncated before loading)NOTE_IO_COD: MTF (presence of a notes file)

Page 15: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

The configuration file

MEF:The DR_COD code is managed by the mef_sta_build.p_dr_cod functionThe FR_COD code is managed by the mef_sta_build.p_fr_cod function The OFF_COD code is managed by mef_sta.f_off_cod function. See further detail in Recipe 12 on SlideshareThe functions that handle the day range are mef_sta_build.p_from_dr_cod and mef_sta_build.p_to_dr_cod.In this way, by changing the functions we can define other codes. The mef_sta_build.p_objday_cft will load the IODAY_CFT table.

The complete configuration of the data file is done by launching the procedure

SQL> @sta_conf_io MTF

Page 16: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

The data file loading

• The process of loading of the data file, must insert in a log table the information related to the elaboration day and to the reference day received from the feeding system.

MEF:SQL> exec mef_job.p_run('sta_esm_mtf');

• Comparing, at the end of loading, what is configured with what is loaded, we can infer a final outcome of the process. This comparison may be displayed by means of a view which we will call IODAY_CFV.

• The logic with which works the view was summarized in a previous figure. On the basis of this outcome, it must be agreed upon an intervention strategy.

• In our example, launched on a working day, we see that there is a problem related to the reference day.

• Also there is another problem to be investigated: the number of rows declared in the control file is different from the number of rows loaded.

Page 17: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

Conclusion

• Whatever way we implement an ETL solution, the important point to emphasize is that we need to know before, the time characteristics of the data file that we will load.

• For each calendar day, we must have clear what I expect to receive on that day and, for any given data file, what is the reference day that I expect to find inside.

• There can be no doubt or ambiguity: is information that we need to know in advance and we have to configure. After the loading of the Staging Area, only the comparison between what we expected to receive with what we actually received, will allow us to evaluate the correctness of the loaded data.

• It ' just remember that this correctness check is a priority, is the first check, and it refers only to the two time components of the data. Only if these checks are positive, it will make sense to continue with the other quality controls.

Page 18: Recipe 16 of Data Warehouse and Business Intelligence - The monitoring of the days on the data files of a Data Warehouse

References

On Slideshare:the series: Recipes of Data Warehouse and Business Intelligence.

Blog:http://microetlfoundation.blogspot.ithttp://massimocenci.blogspot.it/

Micro ETL Foundation free source at: https://drive.google.com/open?id=0B2dQ0EtjqAOTQzZSaUlyUmxpT1kLast version v2.

Email: [email protected]