data migration2

21
National Enterprise Wide Statistical System (NEWSS) DATA MIGRATION Azizah Bt Hashim, Nur Hurriyatul Huda Bt Abdullah Sani Abstract This paper aims to examine the impact of data migration in National Enterprise Wide Statistical System (NEWSS) to the Department Of Statistic Malaysia . Number of data is drawn from NEWSS Phase 1 and II included data from Economic Census 2005 and 2010. ETL model is a comprehensive method to study NEWSS data migration because we can investigate the effect on the data across sectors in department. The general findings of this paper is that the contribution of data migration activities to the Operational of System NEWSS is significantly increased in 2014 compared to 2011. This is in line with the Department mission and objective to produce integrity and reliability data of National Statistics through the use of the best technology, and to improve and strengthen statistical services and delivery system. Keywords: Database migration, Data Migration, ETL, Objective and Mission 1. Introduction With the rapid growing of business requirement in DOSM and new enterprise wide application integration, organizations come to a stage where they have to change from working in separated database and multiple platform to a single and integrated one. Migration also happen when Organization realize that the existing systems have performance and scalability limitations, which cannot cater to their ever-expanding business needs. Data migration is the process of transferring data between storage types, formats, or computer system. It is required when Organizations or Individuals Change Computer Systems or Upgrade 1

Upload: taqiyuddin-abdul-rahman

Post on 17-Dec-2015

217 views

Category:

Documents


4 download

DESCRIPTION

ZXZxxzc

TRANSCRIPT

National Enterprise Wide Statistical System (NEWSS) DATA MIGRATION Azizah Bt Hashim, Nur Hurriyatul Huda Bt Abdullah Sani

Abstract

This paper aims to examine the impact of data migration in National Enterprise Wide Statistical System (NEWSS) to the Department Of Statistic Malaysia . Number of data is drawn from NEWSS Phase 1 and II included data from Economic Census 2005 and 2010. ETL model is a comprehensive method to study NEWSS data migration because we can investigate the effect on the data across sectors in department. The general findings of this paper is that the contribution of data migration activities to the Operational of System NEWSS is significantly increased in 2014 compared to 2011. This is in line with the Department mission and objective to produce integrity and reliability data of National Statistics through the use of the best technology, and to improve and strengthen statistical services and delivery system.

Keywords: Database migration, Data Migration, ETL, Objective and Mission

1. IntroductionWith the rapid growing of business requirement in DOSM and new enterprise wide application integration, organizations come to a stage where they have to change from working in separated database and multiple platform to a single and integrated one. Migration also happen when Organization realize that the existing systems have performance and scalability limitations, which cannot cater to their ever-expanding business needs.

Data migration is the process of transferring data between storage types, formats, or computer system. It is required when Organizations or Individuals Change Computer Systems or Upgrade to New Systems or when System Merge. Usually data migration performed programmatically to achieve an automated migration.

figure 1: Data migration flow in DOSM

There is a different between data migration and database migration, though database migration encompasses data migration also. Database migration essentially means the movement of data and conversion of various other structures and objects associated with the database including schema and applications associated with the current system to a different technology/platform. Database migration is one of the most common but a major task in any application migration. Example of activity comprises in database migration are

Business Logic - Stored Procedure, Triggers, Packages, Functions

Schema Tables, Views, Synonyms, Sequences, Indexes

Physical Data Security, Users, Roles, Privileges

Database dependency of applications associated with the database

Data migration is simply the movement of data from one database (or File System)/platform to another. This may include extraction of the data, cleansing of the data and loading the same into the target database. for example, when an application is developed, it is required to get those data for the newly developed application to operate. In this case only the data is moved from the required database to the database used by the new application.In simple ways database migration can be referred when there is a shifting from one type of database systems to an entirely new type of database system or to a database system with entirely new features and functionality. Hence data migration is a subset when database migration activities are carried out, though data migration may also be taken up independently.

There are interesting question why it is required to move to other database while the existing systems are running with current database. the reason why data migration are important in DOSM are

1. Avoid Businesses Failure2. Improve corporate performance and deliver competitive advantage 3. Efficient and effective business processes (centralized db)4. Measureable and accurate view of data 5. Perceive better value in the newer system in term of standardization of operational field work and data entry

2. Literature reviewThere are a number of studies conducted on best practices for data migration. For example data migration, Methodologies for assessing, planning, moving and validating data migration by IBMGlobal Technology Services, October 2009 and NetApp Global Services,

January 2006. Meanwhile, study about Database Migration Approach & Planning done by Keshav Tripathy, Pragjnyajeet Mohanty and Biraja Prasad Nath(2002).

From the study by Martin Wagner, March 17, 2011 on Introduction on Patterns for Data Migration Projects, he conclude that the quality constraints on the data in the old

system may be lower than the constraints in the target system. Inconsistent or missing

data entries that the legacy system somehow copes with (or ignores) might cause severe

problems in the target system. In addition, the data migration itself might corrupt the

data in a way that is not visible to the software developers but only to business users.

NetApp Global Services, January 2006 on Data Migration Best Practices state that for IT managers, data migration has become one of the most routineand challengingfacts of life. With the increase in the percentage of mission-critical data and the proportionate increase in data availability demands, downtimewith its huge impact on a companys financial bottom linebecomes unacceptable. In addition, business, technical and operational requirements impose challenging restrictions on the migration process itself. Resource demandsstaff, CPU cycles, and bandwidthand risksapplication downtime, performance impact to production environments, technical incompatibilities, and data corruption/lossmake migration one of ITs biggest challenges. Since the majority of storage systems purchased by customers is used to store existingrather than newdata, getting these new systems production-ready requires that data be copied/moved from the old system to be replaced to the new system being deployed. Whether the migration is performed by internal IT or an external services provider, the migration methodology is the same.On the other hand, IBM Global Technology Services October 2009 mention that when systems must be taken down for migration, business operations can be seriously affected. A keyway to minimize the business impact of data migration is to use best practices that incorporate planning, technology implementation and validation. Any change in the storage infrastructure, whether it is a tech-nology refresh, consolidation, relocation or storage optimization,requires an organization to migrate data.There are a variety of software products that can be used to migrate data,including volume-management products, host- or array-based replication products and relocation utilitiesas well as custom-developed scripts. Each ofthese has strengths and weaknesses surrounding performance, operating system support, storage-vendor platform support and whether or not application downtime is required to migrate the data. Some of these products enable online migration of dataso applications dont need to be taken offline during the migration process. A subset of these provides nondisruptive migration,which means that applications not only remain online, but also that application processing continues without interruption or significant performance delays. Therefore, IT organizations should carefully explore software options.Specific requirements can help determine the best software technology to use for each migration. In addtion , Keshav Tripathy, Pragjnyajeet Mohanty and Biraja Prasad Nath on Database Migration Approach & Planning state that Database migration, consists of three major components, they are

Schema Migration This consists of mapping and migrating the source schema with the target schema. For this the schema needs to be extracted from the source system and the equivalent needs to be replicated in the target system

Data Migration This is the part where the data is extracted from the source database. Then it is checked for consistency and accuracy, it is cleansed if necessary. Finally it is loaded into the target system.

Application Migration This necessarily consists of changing the database dependent areas (function calls, data accessing methods etc) of the application so that the Input/Output behavior of the converted application with the target database is exactly identical with that of the original application with the source database.

However, this paper just focus on the methodology of data migration only, which cover the data for NEWSS Phase 1 and II. The problem on migration in DOSM in all includes the difficulties in getting the final source data untill the completion of migration project.

3. Methodology

3.1 Definition of Data Migration According en.wikipedia.org, Data migration is the process of transferring data between storage types, formats, or computer systems. Data migration is usually performed programmatically to achieve an automated migration. By perform programmatically/automatic it freeing up human resources from tedious tasks.

It is also required when organizations or individuals change computer systems or upgrade to new systems, or when systems merge happen. In DOSM environment, migration happen by transferring database from old silo system to integrated one.

For the purpose of this study, we considered the impact of data migration in National Enterprise Wide Statistical System (NEWSS) to the Department Of Statistic Malaysia.3.2 Source of Data

Number of migration data is drawn from from NEWSS Phase 1 and II included data from Economic Census 2005 and 2010.3.3 Analysis The process occur during the migration process are analysis, mapping, planning, designing, testing, loading n verifying.The analysis happen in the source system, after that process extract and transform data into staging area. Staging area is a workspace where we work to clean, put a rules, validate data before we load it into the target.

figure 1. 1: Data Migration Methodology3.4 Data Migration Lyfe CycleData Migration Lyfe Cycle inclusive 6 phase which is analyze phase, map, high level design, detail design, construct and test & deploy.

figure 1. 2: Data Migration Lyfe Cycle3.5 DATA MIGRATION WORK FLOW

figure 1. 2: Data Migration Lyfe Cyclefigure 1. 3: Data Migration Work Flow

In data migration process, there is a few steps that need to be done as below:

a) Prerequisite - all the requirement for migration need to be defined properly and field list need tp be prepared based on the NEWSS database format. All the Questionnaire form, sample of data and explanation for requirement needed for further process.b) Mapping - This is phase to identify database fields from source file and map it with the fields in the NEWSS databases. The mapping is important to identify which field in database related to which field in NEWSS databases, to identify the related field changes from year to year in questionnaire, to identify the related field changes from year to year in sources data, to find information about changes in each fields like new field has been created for the current database or certain field been drop or fields been split out or combined. so all those information in this phase is very important for the next phase to be execute.c) Designing - In this phases the script for migration been developed based on the mapping gathered in step (b) and by referring questionnaire and fields list in steps (a). After the script been designed, the script been tested first by the sample data gathered in steps (a). This testing part need to be perform to ensure that the script developed meet the requirement and no error.d) Data Source - Verification on the data sources and format been done in this phase.

e) Cleansing and Loading Data - The Verification on the data sources with migration script been done to check either the source data is clean or not. if during the verification process the data show error, so the details checking and correction need to be made by SMD. Otherwise if the verification shows a successful process, the data will go to the final verification and be prepared for the next steps.f) Production Loaded - The Data with Final Verification that been prepared on steps (e) been loaded to the Production database. After the data been loaded a few testing using the NEWSS system on the data loaded need to be perform by SMD. After the successful testing, the approval of data migrated will be done by SMD.3.6 Master Data Management Master Data Management comprise of 4 main activity as below:1. Identify data source - In NEWSS, data source will come from other Statistical system/OLTP, and various sources such as from MS Excel, csv, MS Access, flat file, My SQL, MS SQL Server, etc. 2. Create data profiling - the process of examining the DOSM data such as Economic, Population, Trade, External Trade, Labour Force , everything is available in existing data source information. The important of profiling are:

a) The aim is to understand your data completely and fully

b) to improve data quality (by clarify the structure, content, relationships )

c) to improve understanding anomalies of data for the users (basis for an early go/no-go decision)

d) To discover, register, and assess enterprise metadata (it will validate metadata when it is available and to discover metadata when it is not)

e) And the last one improving data accuracy in corporate databases. (which helps to assure that data cleaning and transformations have been done correctly according to requirements.)

3. Check data quality - this process aim to discover data what have you missed, when things go wrong, making confident decision and reliability of data for further data analysis and data analytics. For example, It checks all relevant data such as gender, addresses, postcode, district, state, and date format for a given respondents is required. Common data problems like misspellings, typing error, and random abbreviations have been cleaned up.

4. Extract, transform and load - Extract referring to extract data from multiple sources and format (MS Excel, csv, MS Access, flat file, oracle, My SQL, MS SQL Server and etc.) to a single standardize format. Meanwhile transfors involved data mapping, verify process, code generation and data conversion. Load which is transfer the data from historical data into production databased or end target.However in NEWSS current migration process, the data profiling and data quality activities not been done because of the certain constraint. To substitute with this unavailability component, the checking data happen in the process of extract, transform and load. However it will take the process of migration slightly longer to be completed. Practice done during this ETL activities are, whatever data problem will be passing back to SMD for checking, and SMD will check and make a corresponding action. After completed, SMD will send the data back to the BPM. The new data need to be validate again, and if there is still data problem so we need to return it back to SMD for correction. So this process usually will drag the time for ETL activities.

figure 1. 4: Extract, Transform and Load (ETL) diagram

3.7 Migration tools,In our market for migration tools, there were many software tools that being use for ETL purpose. All the available tools do have their own strength and weaknesses. Below is the tools that been explored during the migration process conduct in DOSM.a) Talend

really powerful, stable and customizable

It's quite-well embeddable, it produces java code

The drawback is the learning curve

b) Pentaho

Its ETL tool (named Kettle)

is just a component of Pentaho Business Intelligence open platform

It's java-based

The major drawback is that Kettle is much harder to extend than Talend.

c) CloverETL

mostly younger

it's light, easily embeddable and easy to learn

But it's really much less powerful than Talend and even than Kettle.

3.8 CHALLENGESChallenges that occurs during migration process in DOSM are :

a) Analyze -difficulty of data collection process because of data come from different sources and misunderstanding of user requirements.

b) Mapping - Uncontrollable of mapping versioning due to frequent changes of the survey form.c) Design Migration Script - usually time constraint problem because large number of data been inform to be migrate in a short time frame. sometime ad hoc request from SMD.d) Cleansing Data - Hard to get the real final data because of a few revision release. Sometime data cleansing need to be confirmed several times from the SMD especially if there is a data problem during the ETL process.e) Testing - Checking the data involved many reference tables ie: Establishment Frame, MSIC Code, Household Frame, locality and etc.f) Data Loaded - Some data patching activities impacts the new version of final data.

4. Findings and Discussion

4.1. Way forwardBased on the migration activities done for NEWSS system in DOSM, there are some findings and enhancement have been suggest as belowCURRENTFUTURE

No Profiling activity Establish profiling phase

No centralized final data StatsDW

Patching Less data patching

Repetition migration process impacts varies of Final Data Final Data is fixed and cannot be altered

Hard to get the clean data which effect DM processFix the cleansing data process at the beginning stage

Table 1: Finding during migration process4.2Contribution of Migration activities to the DOSMBased on migration process for NEWSS, the result is shown in Table 2 as below.CURRENTFUTURE

No Profiling activity Establish profiling phase

No centralized final data StatsDW

Patching Less data patching

Repetition migration process impacts varies of Final Data Final Data is fixed and cannot be altered

Hard to get the clean data which effect DM processFix the cleansing data process at the beginning stage

Table 1: Finding during migration process4.3Contribution of Migration activities to the NEWSSFrom paired t-test conducted, null hypothesis was rejected (p-value = 6.0 x 10-9). Mean value for multiplier of unskilled labour in 2010 is 1.2897 which is smaller than 1.4539 recorded in 2005. This reveals that the contribution of unskilled labour for the period of 2005 and 2010 is negatively significant. This is once again in line with government policies, to reduce dependency towards unskilled labour. The result is shown in Table E as attach in Appendix.

4.3Contribution of Migration activities to the division in DOSMFurther investigation need to be conducted across sectors to assist the policy makers in determining which sector(s) of the economy to spend one additional unit. A comparison of output multipliers would show where spending would have the greatest impact on output or employment generated throughout the economy (Hussain, 2011).

Table 2 portrays ten sectors with highest multiplier increase of skilled labour for the period of 2005 and 2010, lead by Measuring, Checking & Industrial Process Equipment (0.448), followed by Other Mining and Quarrying (0.355) and Recycling (0.335).

4.4 Contribution of Migration activities to the IT DivisionTable 2: Top Ten Sectors with Highest Multiplier Increase of Skilled Labour

SECTORSKILLED 2005SKILLED 2010DIFFERENCESRANK

Measuring, Checking & Industrial Process Equipment0.293170.741010.447831

Other Mining and Quarrying0.102710.457950.355242

Recycling0.166630.502560.335933

Confectionery0.184080.511170.327094

Sheet Glass and Glass Products0.299670.612030.312365

Communication0.712800.988390.275596

Medical, Surgical and Orthopaedic Appliances0.405150.645970.240827

Forestry and Logging 0.112210.349340.237128

Computer Services1.082381.295910.213539

Financial Institution0.843191.045170.2019810

From table 3, five sectors were identified to be contrary with the government intention to increase the number of skilled labour in Malaysias labour market. The multiplier effect of skilled labour are found to be smaller in 2010 for these five sectors, which shows that the additional skilled employment generated in 2010 is smaller than in 2005. The highest multiplier decrease was recorded by Fertilizer (-0.524), Publishing (-0.251), Water Transport (-0.234), Air Transport (-0.234) and Other Private Services (-0.227).

Table 3: Bottom Five Sectors with Highest Multiplier Decrease of Skilled Labour

SECTORSKILLED 2005SKILLED 2010DIFFERENCESRANK

Fertilizers0.886170.36188-0.52430120

Publishing 0.679560.42833-0.25123119

Water Transport0.813890.57923-0.23466118

Air Transport1.169090.93468-0.23441117

Other Private Services0.414260.18688-0.22738116

With the emergence of many new technologies in varies sectors of Malaysian economy, the dependency towards unskilled labour are expected to be decreasing. However, from Table C, 32 sectors are found to be contrary with above statement, lead by ten sectors as presented in Table 4. The employment multiplier of unskilled labour of these sectors is found to be greater in 2010 than 2005. Hence, these sectors need to be given more consideration by the government to spend one additional unit of investment because most of the sectors are seen to require more skilled labour.

Table 4: Top Ten Sectors with Highest Multiplier Increase of Unskilled Labour

SECTORUNSKILLED 2005UNSKILLED 2010DIFFERENCESRANK

Recycling0.976301.598570.622271

Forestry and Logging1.154661.751420.596762

Wooden and Cane Containers1.667412.229180.561773

Sawmilling and Planning of Wood1.624242.096140.471894

Business Services0.985161.418140.432995

Rubber Products1.718672.123250.404576

Veneer Sheets,Plywood,Laminated & Particle Board1.776532.157290.380767

Builders' Carpentry and Joinery1.796152.112970.316818

Other Chemicals Product1.175501.441790.266299

Preservation of Seafood1.961802.208410.2466110

5. Concluding Remarks

The contribution of data migration in the operational of NEWSS has been proved by increasing of demand by SMD on migrated data. Skilled labour is also crucial in generating the gross domestic product (GDP), improving the image of the industry and creating a productivity workforce (Mohamad Faiz, 2008). Shamley and Ishak (2011) have concluded that skilled workers contribute positively to the productivity in manufacturing sectors.

In achieving the governments goal to become a comparative, developed and high income nation by 2020, the government has targeted to produce 50 per cent skilled labours in various fields. Currently the country had only 28 per cent skilled labour and it was targeted to increase to 33 per cent next year.

Hence, this study was conducted with the aim to investigate whether current labour market is in line with government intention. The employment multiplier was calculated using Input-Output Model. In addition, the contribution of skilled and unskilled labour has been investigated using paired t-test. The results are presented in Table 1-3. The analysis across sectors was conducted to identify which sector(s) employment changes was occurs (Table 4-6, Table A and B). The detailed information provided by employment multiplier across sectors perhaps serve as valuable input to the national planners in determining which sectors need to be improved.

References

Chinkook, L. and Gerald, S. (1999). Effect of Trade on the Demand for Skilled and Unskilled Workers Economic Systems Research, Vol.11, No.1, 1999

http://www.theborneopost.com/2014/06/05/50-per-cent-skilled-workers-targeted-to-be-produced-by-2020/

http://www.investopedia.com/terms/s/skilled-labor.asp

http://www.investopedia.com/terms/u/unskilled-labor.asp

Hussain Ali Bekhet. Output, Income and Employment Multiplier in Malaysian Economy: Input-Output Approach, International Bisiness Research, Vol.4, No.1, January 2011.

Lowell, L. and Batalova, J. (2005). International Migration of Highly Skilled Workers: Methodological And Public Policy Issues. Population Association of America 2005 Annual Meeting Program.

Madeline B.D. Input-Output Multiplier Analysis for Major Industries in The Philipines, 11th National Convention on Statistics (NCS), EDSA Shangri-La Hotel, October 4-5, 2010.

MASCO (2010). Kementerian Sumber Manusia Malaysia.

Mohd Shalemy and Ishak Yussof (2011). High Skilled Workers and Productivity in Manufacturing Sector in Malaysia. Prosiding PERKEM VI, Jilid 2 (2011) 308-318

Raouf, R. and Hafid, H. Relocation and Inequalities between Skilled and Unskilled in Northern Countries: Simulation Using a CGE Model. International Journal of Economics and Financial Issues. Vol.4, No.4, 2014, PP.758-772.

Roberts, J. and Skoufias, E. (1997). The Long-Run Demand for Skilled and Unskilled Labour in Colombian Manufacturing Plants. The Review of Economics and Statistics, Vol. 79, No. 2. (May, 1997), pp. 330-334.

Siti Rahmah and Nurul Naqiah (2013). Foreign Employment Multiplier in Malaysia: An Input-Output Analysis. Presented in Technical Paper Presentation 2013.

Appendix

Table A: Total Employment Multiplier

Year20052010

SECTORDESCRIPTIONUNSKILLEDSKILLEDUNSKILLEDSKILLED

1Paddy1.25400.14111.09530.0750

2Food Crops1.38740.11031.12980.1061

3Vegetables1.27900.13521.20330.1057

4Fruits1.49930.15011.13680.1018

5Rubber1.28400.11351.34090.2722

6Oil Palm1.36440.18061.17200.1632

7Flower Plants1.72220.24401.13830.1033

8Other Agriculture1.54050.16981.08370.1162

9Poultry Farming1.72040.29571.52530.2580

10Other Livestock1.58520.21391.49310.2887

11Forestry and Logging 1.15470.11221.75140.3493

12Fishing1.55510.12641.53810.3065

13Crude Oil and Natural Gas0.64590.66140.48850.6990

14Metal Ore Mining1.80250.58520.93980.4113

15Stone Clay and Sand Quarrying1.69600.39031.12000.1649

16Other Mining and Quarrying0.94850.10271.05530.4580

17Meat and Meat Production2.36140.27741.89160.4132

18Preservation of Seafood1.96180.21892.20840.3500

19Preservation of Fruits and Vegetables1.70550.29931.58400.3437

20Dairy Production1.74290.35601.58310.4450

21Oils and Fats 2.41620.33722.26140.3983

22Grain Mills1.54730.30931.55760.2303

23Bakery Products1.97250.34321.55380.3059

24Confectionery1.44080.18411.02320.5112

25Other Food Processing1.88800.33111.37780.3827

26Animal Feeds1.38480.42211.43220.4257

27Wine and Spirit1.56380.19981.15770.3179

28Soft Drink1.65510.39631.45860.4141

29Tobacco Products1.68680.30740.84910.4958

30Yarn and Cloth1.41240.42531.41170.4538

31Finishing of Textiles2.13640.38621.51770.4488

32Other Textiles1.71510.36361.33600.3228

33Wearing Apparel1.46650.24981.32620.2610

34Leather Industries1.67180.13781.47260.2935

35Footwear1.62930.24571.47820.2645

36Sawmilling and Planning of Wood1.62420.18232.09610.3414

37Veneer Sheets,Plywood,Laminated & Particle Board 1.77650.24512.15730.3553

38Builders' Carpentry and Joinery1.79620.25532.11300.3912

39Wooden and Cane Containers1.66740.18002.22920.3707

40Other Wood Products1.70800.22991.88790.3955

14