migration to azure synapse analytics

Migration to Azure Synapse Analytics

Section 2.1 – Data migration, ETL and load for

Teradata migrations

Data migration, ETL and load for Teradata migrations

Copyright © Microsoft Corporation, 2019, All Rights Reserved 2

Table of contents

Context ........................................................................................................................ 3

Overview ..................................................................................................................... 4

Data migration considerations ................................................................................. 6

Initial decisions regarding data migration from Teradata .................................................. 6

Migrate unused tables? ............................................................................................................. 6

What’s the best migration approach to minimize risk and impact on users? ...... 7

Using a VM Teradata instance as part of a migration.................................................... 7

Migrating data marts – stay physical or go virtual?? ...................................................... 7

Data migration from teradata ........................................................................................................ 9

Understand your data ................................................................................................................. 9

Teradata data type mapping ................................................................................................... 9

ETL migration considerations ................................................................................. 13

Initial decisions regarding Teradata ETL migration ............................................................ 13

Re-engineering existing Teradata-specific scripts .............................................................. 14

Using existing 3rd party ETL tools .............................................................................................. 15

Ab Initio ......................................................................................................................................... 15

Attunity .......................................................................................................................................... 16

Informatica ................................................................................................................................... 16

Pentaho ......................................................................................................................................... 16

Talend ............................................................................................................................................ 17

Wherescape ................................................................................................................................. 18

Data loading from Teradata .................................................................................... 19

Choices available when loading data from Teradata ........................................................ 19

Transfer data via files or network connection? .............................................................. 19

Orchestrate from Teradata or Azure? ................................................................................ 20

Which tools can be used? ...................................................................................................... 20

Summary ................................................................................................................... 21



Context

This paper is one of a series of documents which discuss aspects of migrating legacy

data warehouse implementations to Azure Synapse. The focus of this paper is on

data migration, ETL and loading specifically from existing Teradata environments –

other topics such as ETL, recommended migration approach and advanced analytics

in the data warehouse are covered in separate documents. This document should be

read in conjunction with the ‘Section 2 – Data ETL and Load’ document which

discusses the general aspects of design and performance for migrations to Azure

Synapse.



Overview

Existing users of Teradata data warehouse systems are now looking to take

advantage of the innovations provided by newer environments (e.g. cloud, IaaS,

PaaS) and to delegate tasks such as infrastructure maintenance and platform

development to the cloud provider.

While there are similarities between Teradata and Azure Synapse in that both are

SQL databases designed to use massively parallel processing (MPP) techniques to

achieve high query performance on very large data volumes, there are also some

basic differences in approach:

• Legacy Teradata systems will usually be installed on-premise, using proprietary

hardware, whereas Azure Synapse is cloud based using Azure storage and

compute resources.

• Upgrading a Teradata configuration is a major task involving additional physical

hardware and a potentially lengthy database reconfiguration. Since storage and

compute resources are separate in the Azure environment these can easily be

scaled (upwards and downwards) independently leveraging the elastic scalability

capability.

• Azure Synapse can be paused or resized as required to reduce resource

utilization and therefore cost.

Microsoft Azure is a globally available, highly secure, scalable cloud environment

which includes Azure Synapse within an eco-system of supporting tools and

capabilities.

Azure Synapse provides best-of-breed relational database performance by using

techniques such as massively parallel processing (MPP) and automatic in-memory

caching – the results of this approach can be seen in independent benchmarks such

as the one run recently by GigaOm – see https://gigaom.com/report/data-

warehouse-cloud-benchmark/ which compares Azure Synapse to other popular

cloud data warehouse offerings. Customers who have already migrated to this

environment have seen many benefits including:

‘More than just a

database’ – the Azure

environment includes a

comprehensive set of

capabilites and tools

https://gigaom.com/report/data-warehouse-cloud-benchmark/

https://gigaom.com/report/data-warehouse-cloud-benchmark/



• Improved performance and price/performance

• Increased agility and shorter time to value

• Faster server deployment and application development

• Elastic scalability – only pay for actual usage

• Improved security/compliance

• Reduced storage and Disaster Recovery costs

• Lower overall TCO and better cost control (OPEX)

To maximize these benefits it is necessary to migrate existing (or new) data and

applications to the Azure Synapse platform, and in many organizations this will

include migration of an existing data warehouse from legacy on-premise platforms

such as Teradata or Netezza. At a high level, the basic process will include the

following steps:

This paper looks at the data migration, ETL and loading aspects of migration from a

legacy Teradata data warehouse and data marts onto Azure Synapse. The topics

included in this paper apply specifically to migrations from an existing Teradata

environment.



Data migration considerations

Initial decisions regarding data migration from Teradata

When it comes to migrating a Teradata data warehouse, there are a few basic

questions associated with data that need to be asked. For example:

• Should unused table structures be migrated or not?

• What’s the best migration approach to minimize risk and impact for users?

• Migrating data marts – stay physical or go virtual?

The next sections discuss these points within the context of a migration from

Teradata.

Migrate unused tables?

It generally makes sense to only migrate the tables that are actually in use in the

existing system. Tables which are not active can be archived rather than migrated so

that the data is available if required in future. It’s best to use system metadata and

logfiles to determine which tables are in use, rather than documentation as

documentation may be out of date.

Teradata system catalog tables and logs contain information which can be used to

determine when a given table was last accessed – which in turn can be used to

decide whether or not a table is a candidate for migration.

A simple query on dbc.tables can provide the date of last access and last

modification, for example:

Select TableName, CreatorName, CreateTimeStamp, LastAlterName, LastAlterTimeStamp, AccessCount, LastAccessTimeStamp from DBC.Tables t Where DataBaseName = 'databasename'

If logging has been enabled and log history is accessible, much more information

including SQL query text is also available in table DBQLogTbl and the associated

logging tables. See

https://docs.teradata.com/reader/wada1XMYPkZVTqPKz2CNaw/PuQUxpyeCx4jvP8X

CiEeGA for more details.

In legacy systems it is not

unusual for tables to

become redundant over

time – these don’t need

to be migrated in most

cases

https://docs.teradata.com/reader/wada1XMYPkZVTqPKz2CNaw/PuQUxpyeCx4jvP8XCiEeGA

https://docs.teradata.com/reader/wada1XMYPkZVTqPKz2CNaw/PuQUxpyeCx4jvP8XCiEeGA



What’s the best migration approach to minimize risk and impact on

users?

This question comes up a lot as companies may want to lower the impact of

changes to the data warehouse data model to improve agility, and would like to

seize the opportunity to do so during a migration to modernize their data model.

This carries higher risk as it would almost certainly impact ETL jobs populating the

data warehouse and also taking from a data warehouse to feed dependent data

marts. Therefore it may be better to do a re-design of this kind of scale after data

warehouse migration.

Even if a data model change is to be part of the overall migration, it is good practice

the migrate the existing model ‘as-is’ to the new environment (Azure Synapse in this

case), then do any re-engineering on the new platform. This approach has the

advantage of minimizing the impact on existing production systems while also

leveraging the performance and elastic scalability of the Azure platform for one-off

re-engineering tasks.

When migrating from Teradata, there is also the option of creating a Teradata

environment in a VM within Azure as a ‘stepping stone’ in the migration process.

Using a VM Teradata instance as part of a migration

One optional approach for running a migration from an on-premise Teradata

environment is to leverage the Azure environment which provides cheap cloud

storage and elastic scalability to create a Teradata instance within a VM in Azure, co-

located with the target Azure Synapse environment.

With this approach, standard Teradata utilities such as Teradata Parallel Data

Transporter (or 3rd party data replication tools such as Attunity Replicate) can be

used to efficiently move the subset of Teradata tables which are to be migrated onto

the VM instance, and then all migration tasks can take place within the Azure

environment. This approach has several benefits:

• After the initial replication of data, the source system is not impacted by the

migration tasks

• The familiar Teradata interfaces, tools and utilities are available within the Azure

environment

• Once in the Azure environment there are no potential issues with network

bandwidth availability between the on-premise source system and the cloud

target system

• Tools such as Azure Data Factory can efficiently call utilities such as Teradata

Parallel Transporter to migrate data quickly and easily

• The migration process is orchestrated and controlled entirely within the Azure

environment

Migrating data marts – stay physical or go virtual??

In legacy Teradata data warehouse environments it is common practice to create a

number of data marts which are structured to provide good performance for ad hoc

self service queries and reports for a given department or business function within

Migrate the existing

model ‘as-is’ initially even

if a change to the data

model is planned for the

future

Virtualizing data marts

can save on storage and

processing resources



an organization. As such, a data mart typically consists of a subset of the data

warehouse containing aggregated versions of the data in a form that enables users

to easily query that data with fast response times via user-friendly query tools such

as Microsoft Power BI, Tableau or Microstrategy. This form is generally a dimensional

data model, and one use of data marts is to expose the data in a usable form even if

the underlying warehouse data model is something different (e.g. data vault).

Separate data marts for individual business units within an organization can also be

used to implement robust data security regimes, by only allowing user access to

specific data marts relevant to them, and eliminating, obfuscating or anonymizing

sensitive data.

If these data marts are implemented as physical tables, they required additional

storage resources to store them and also additional processing to build and refresh

them on a regular basis. It also implies that the data in the mart is only as up to date

as the last refresh operation – so it may not be suitable for highly volatile data

dashboards.

With the advent of relatively low cost scalable MPP architectures such as Azure

Synapse and their inherent performance characteristics, it may be that data mart

functionality can be provided without having to instantiate the mart as a set of

physical tables. This is achieved by effectively virtualizing the data marts via SQL

views on to the main data warehouse or via a virtualization layer using features such

as views in Azure or 3rd party virtualization products such as Denodo. This approach

simplifies or eliminates the need for additional storage and aggregation processing,

and reduces the overall number of database objects to be migrated.

There is also another potential benefit of this approach – by implementing the

aggregation and join logic within a virtualization layer and presenting external

reporting tools via a virtualized view, the processing required to create these views is

‘pushed down’ into the data warehouse, which is generally the best place to run joins

and aggregations etc. on large data volumes.

The primary drivers for choosing to implement physical or virtual data mart

implementation are:

• More agility as a virtual data mart is easier to change than physical tables and

the associated ETL processes

• Lower total cost of ownership because of fewer data stores and copies of data in

a virtualized implementation

• Elimination of ETL jobs to migrate and simplified DW architecture in a virtualized

environment

• Performance – historically physical data marts have been more performant,

though virtualization products are now implementing intelligent caching

techniques to mitigate this

The performance and

scalability of Azure

Synapse enables

virtualization without

sacrificing performance



Data migration from Teradata

Understand your data

Part of the migration planning should be to understand in detail the volume of data

to be migrated as this can impact decisions on the migration approach to take. Use

system metadata to determine the physical space taken up by the ‘raw data’ within

the tables to be migrated. In this context ‘raw data’ means the amount of space used

by the data rows within a table excluding overheads such as indexes and any

compression. This is especially true for the largest fact tables as these will typically

comprise > 95% of the data.

A good way to get an accurate number for this for a given table is to extract a

representative sample of the data (e.g. 1 million rows) to an uncompressed delimited

flat ASCII data file and use the size of that to give an average raw data size per row

of that table. Multiply this average size by the total number of rows in the full table

to give a raw data size for that table and use this figure in planning.

Teradata data type mapping

Some Teradata datatypes are not directly supported in Azure Synapse – below is a

table which shows these data types together with the recommended approach for

handling these.

In the table below, Teradata Column Type is the type that is stored within the

system catalog (e.g. in DBC.ColumnsV).

Teradata

Column Type

Teradata

Data Type ASDW Data Type

++ TD_ANYTYPE Not supported in Azure Synapse

A1 ARRAY Not supported in Azure Synapse

AN ARRAY Not supported in Azure Synapse

AT TIME TIME

BF BYTE BINARY

BO BLOB BLOB data type isn't directly supported but

can be replaced with BINARY

BV VARBYTE BINARY

CF CHAR VARCHAR

CO CLOB CLOB data type isn't directly supported but

can be replaced with VARCHAR

CV VARCHAR VARCHAR

D DECIMAL DECIMAL

DA DATE DATE

DH INTERVAL DAY TO

HOUR

INTERVAL data types aren't supported in

Azure Synapse. Date calculations can be done

with the date comparison functions (e.g.

DATEDIFF and DATEADD)

DM INTERVAL DAY TO

MINUTE

INTERVAL data types aren't supported in Azure

Synapse. Date calculations can be done with the

date comparison functions (e.g. DATEDIFF and

DATEADD)

Assess the impact of

unsupported data types

as part of the preparation

phase



DS INTERVAL DAY TO

SECOND




DATEADD)

DT DATASET DATASET data type isn't supported in Azure

Synapse.

DY INTERVAL DAY INTERVAL data types aren't supported in Azure



DATEADD)

F FLOAT FLOAT

HM INTERVAL HOUR TO

MINUTE




DATEADD)

HR INTERVAL HOUR INTERVAL data types aren't supported in Azure



DATEADD)

HS INTERVAL HOUR TO

SECOND




DATEADD)

I1 BYTEINT TINYINT

I2 SMALLINT SMALLINT

I8 BIGINT BIGINT

I INTEGER INT

JN JSON JSON data type is not currently directly

supported within Azure Synapse but JSON data

can be stored in a VARCHAR field

MI INTERVAL MINUTE INTERVAL data types aren't supported in Azure



DATEADD)

MO INTERVAL MONTH INTERVAL data types aren't supported in Azure



DATEADD)

MS INTERVAL MINUTE

TO SECOND




DATEADD)

N NUMBER NUMERIC

PD PERIOD(DATE) Can be converted to VARCHAR or split into 2

separate dates

PM PERIOD(TIMESTAM

P WITH

TIME ZONE)

Can be converted to VARCHAR or split into 2

separate timestamps (DATETIMEOFFSET).

PS PERIOD(TIMESTAM

P)


separate timestamps (DATETIMEOFFSET).

PT PERIOD(TIME) Can be converted to VARCHAR or split into 2

separate times.



PZ PERIOD(TIME WITH

TIME ZONE)


separate times but WITH TIME ZONE isn't

supported for TIME.

SC INTERVAL SECOND INTERVAL data types aren't supported in

Azure Synapse but date calculations can be

done with the date comparison functions

(e.g. DATEDIFF and DATEADD)

SZ TIMESTAMP WITH

TIME ZONE

DATETIMEOFFSET

TS TIMESTAMP DATETIME or DATETIME2

TZ TIME WITH TIME

ZONE

TIME WITH TIME ZONE isn't supported

because TIME is stored using "wall clock"

time only without a time zone offset

XM XML XML data type is not currently directly

supported within Azure Synapse but XML data

can be stored in a VARCHAR field

YM INTERVAL YEAR TO

MONTH




DATEADD)

YR INTERVAL YEAR INTERVAL data types aren't supported in Azure



DATEADD)

Use the metadata from the Teradata catalog tables to determine whether any of

these data types are to be migrated and allow for this in the migration plan. E.g a

SQL query such as the one below can be used to find any occurences of

unsupported data types which need attention:

SELECT ColumnType, CASE WHEN ColumnType = '++' THEN 'TD_ANYTYPE' WHEN ColumnType = 'A1' THEN 'ARRAY' WHEN ColumnType = 'AN' THEN 'ARRAY' WHEN ColumnType = 'BO' THEN 'BLOB' WHEN ColumnType = 'CO' THEN 'CLOB' WHEN ColumnType = 'DH' THEN 'INTERVAL DAY TO HOUR' WHEN ColumnType = 'DM' THEN 'INTERVAL DAY TO MINUTE' WHEN ColumnType = 'DS' THEN 'INTERVAL DAY TO SECOND' WHEN ColumnType = 'DT' THEN 'DATASET' WHEN ColumnType = 'DY' THEN 'INTERVAL DAY' WHEN ColumnType = 'HM' THEN 'INTERVAL HOUR TO MINUTE' WHEN ColumnType = 'HR' THEN 'INTERVAL HOUR' WHEN ColumnType = 'HS' THEN 'INTERVAL HOUR TO SECOND' WHEN ColumnType = 'JN' THEN 'JSON' WHEN ColumnType = 'MI' THEN 'INTERVAL MINUTE' WHEN ColumnType = 'MO' THEN 'INTERVAL MONTH' WHEN ColumnType = 'MS' THEN 'INTERVAL MINUTE TO SECOND' WHEN ColumnType = 'PD' THEN 'PERIOD(DATE)' WHEN ColumnType = 'PM' THEN 'PERIOD(TIMESTAMP WITH TIME ZONE)'



WHEN ColumnType = 'PS' THEN 'PERIOD(TIMESTAMP)' WHEN ColumnType = 'PT' THEN 'PERIOD(TIME)' WHEN ColumnType = 'PZ' THEN 'PERIOD(TIME WITH TIME ZONE)' WHEN ColumnType = 'SC' THEN 'INTERVAL SECOND' WHEN ColumnType = 'SZ' THEN 'TIMESTAMP WITH TIME ZONE' WHEN ColumnType = 'XM' THEN 'XML' WHEN ColumnType = 'YM' THEN 'INTERVAL YEAR TO MONTH' WHEN ColumnType = 'YR' THEN 'INTERVAL YEAR' END AS Data_Type, COUNT(*) AS Data_Type_Count FROM DBC.ColumnsV WHERE DatabaseName IN ('UserDB1', 'UserDB2', 'UserDB3') -- select databases to be migrated GROUP BY 1,2 ORDER BY 1;

There are 3rd party vendors who offer tools and services to automate migration

including the mapping of data types as described above. Also, if a 3rd party ETL tool

such as Informatica or Talend is already in use in the Teradata environment, these

can implement any required data transformations. The next section explores

migration of existing 3rd party ETL processes.



ETL migration considerations

Initial decisions regarding Teradata ETL migration

For ETL/ELT processing, legacy Teradata data warehouses may use custom-built

scripts using Teradata utilities such as BTEQ and Teradata Parallel Transporter (TPT),

or a 3rd party ETL tool such as Informatica or Ab Initio. Sometimes there is a

combination of both approaches that has evolved over time. When planning a

migration to Azure Synapse, the question is how best to implement the required

ETL/ELT processing in the new environment, while minimizing the cost and risk

involved.

The sections below discuss the options available and make some recommendations

for the various use cases. One way to decide on the approach to take can be

summarized by the flowchart below:

The initial step should always be to build an inventory of ETL/ELT processes to be

migrated – again, as with other steps, it may be that standard ‘built-in’ Azure

features mean that some existing processes need not be migrated. For planning

purposes it is important to understand the scale of the migration to be performed.

In the flowchart above, decision 1 relates to the high-level question of whether there

has already been a decision to move to a totally Azure-native environment. If so,

then the recommendation is to re-engineer the ETL processing using Azure Data

Factory (ADF) and associated utilities.

If that is not the case, then decision 2 is whether or not an existing 3rd party ETL tool

is already in use. In the Teradata environment, some (or all) of the ETL processing

may be performed by custom scripts using Teradata-specific utilities such as BTEQ

and TPT. The approach in this case is again to re-engineer using ADF.

If a 3rd party ETL tool such as Informatica or Ab Initio is already in use, (and

especially if there is a large investment in skills and a large number of existing

workflows and schedules in place using that tool) then decision 3 is based on

whether the tool can efficiently support Azure Synapse as a target environment.

Ideally the tool will include ‘native’ connectors which can leverage Azure facilities

Plan the approach for

ETL migration ahead of

time and leverage Azure

facilities where

appropriate



such as PolyBase for the most efficient parallel data loading, but even if these are not

in place there is generally a way of calling an external process (i.e. PolyBase) and

passing the appropriate parameters. In this case the existing skills and workflows can

be leveraged, with the new Azure Synapse becoming the target environment.

If retaining an existing 3rd party ETL tool, there may be benefits to running that tool

within the Azure environment (rather than an existing on-premise ETL server) and

also that the overall orchestration of the existing workflows could be handled by

Azure Data Factory. So decision 4 is whether to leave the existing tool running ‘as-is’

or to move it into the Azure environment to gain cost, performance and scalability

benefits.

Re-engineering existing Teradata-specific scripts

If some or all of the existing Teradata warehouse ETL/ELT processing is handled by

custom scripts which utilize Teradata-specific utilities such as BTEQ, MLOAD or TPT

then these need to be re-coded for the new Azure Synapse environment. Similarly, if

ETL processes have been implemented using stored procedures in Teradata, these

will also have to be recoded.

Some elements of the ETL process are relatively easy to migrate (e.g. simple bulk

data loads into a staging table from an external file) and it may even be possible to

automate these parts of the process, for example using PolyBase instead of Fastload

or MLOAD. Other parts of the process which contain arbitrary complex SQL and/or

stored procedures will take more time to re-engineer.

One way of testing Teradata SQL for compatibility with Azure Synapse is to capture

some representative SQL statements from Teradata logs, then prefix those queries

with ‘EXPLAIN ‘ and then (assuming a like for like migrated data model in Azure

Synapse) run those EXPLAIN statements in Azure Synapse. Any incompatible SQL will

give an error – and this information can be used to determine the scale of the re-

coding task.

In the worst case this will mean manual re-coding, but there are also products and

services available from Microsoft partners to assist with this process. For example

Datometry Hyper-Q (see https://datometry.com/) runs between the legacy

applications and Azure Synapse to translate Teradata queries into Azure Synapse T-

SQL at the network layer.

Consider running ETL

tools in Azure to leverage

performance, scalability

and cost benefits

The inventory of ETL

tasks to be migrated

should include scripts

and stored procedures

Use EXPLAIN to find SQL

incompatibilities

Partners offer products

and skills to assist in re-

engineering Teradata-

specific code

https://datometry.com/



Inspirer also offer tools and services to migrate Teradata SQL and stored procedures

etc. to Azure Synapse – see https://www.ispirer.com/products/teradata-to-azure-sql-

data-warehouse-migration

Using existing 3rd party ETL tools

As described in the section above, in many cases the existing legacy data warehouse

system will already be populated and maintained by a 3rd party ETL product such as

Informatica or Talend. See https://docs.microsoft.com/en-us/azure/sql-data-

warehouse/sql-data-warehouse-partner-data-integration for a list of current

Microsoft data integration partners for Azure Synapse.

There are several popular ETL products which are frequently used in the Teradata

community (some of which are already Microsoft partners listed at the link above).

The following paragraphs discuss the most popular ETL tools currently in use with

Teradata warehouses. All of these products can be run within a VM in Azure, and can

read and write Azure databases and files.

Ab Initio

Ab Initio is a Business Intelligence platform comprised of six data processing

products: Co>Operating System, The Component Library, Graphical Development

Environment, Enterprise Meta>Environment, Data Profiler, and Conduct>It. It is a

powerful GUI-based parallel processing tool for ETL data management and analysis.

Ab Initio is popular in Teradata environments as it can parse Teradata BTEQ scripts

automatically when they are executed using the “Execute SQL” component, which

allows the use of Ab Initio as a primary execution engine for an 'extract, transform,

load and transform' (ETLT) approach.

Ab Initio can run in a VM within Azure and can read and write Azure Storage for files

and SQL Server databases via the ‘Input Table’, ‘Output Table’ and ‘Run SQL’ Ab

Initio components.

Leverage investment in

existing 3rd party tools

to reduce cost and risk

https://www.ispirer.com/products/teradata-to-azure-sql-data-warehouse-migration

https://www.ispirer.com/products/teradata-to-azure-sql-data-warehouse-migration

https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-partner-data-integration

https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-partner-data-integration



Attunity

Attunity CloudBeam for Azure Synapse enables automated and optimized data

loading from many enterprise databases into the Microsoft Azure Synapse - quickly,

easily and affordably. It is available in the Microsoft Azure Marketplace – see

https://www.attunity.com/products/cloudbeam/attunity-cloudbeam-azure/ for more

details.

Attunity Replicate for Microsoft Migrations is for Microsoft customers who want to

migrate data from popular commercial and open-source databases to the Microsoft

Data Platform, including Teradata to Azure Synapse. It can be obtained from

https://www.attunity.com/products/replicate/attunity-replicate-for-microsoft-

migration/.

Attunity benefits for Azure Synapse include:

• Continuous database to Azure Synapse loading

• Quick transfer speeds with guaranteed delivery

• Intuitive administration and scheduling

• Data integrity assurance by way of check mechanisms

• Monitoring for peace-of-mind, control, and auditing

• Industry-standard SSL encryption for security

Informatica

Informatica (see https://www.informatica.com/gb/ ) has 2 offerings which are

available in Azure Marketplace:

Informatica Cloud Services for Azure offers a best-in-class solution for

self-service data migration, integration, and management capabilities.

Customers can quickly and reliably import, and export petabytes of data to

Azure from a variety of sources. Informatica Cloud Services for Azure

provides native, high volume, high-performance connectivity to Azure

Synapse, SQL Database, Blob Storage, Data Lake Store, and Azure Cosmos

DB.

Informatica PowerCenter is a metadata-driven data integration platform

that jumpstarts and accelerates data integration projects in order to deliver

data to the business more quickly than manual hand coding. It serves as the

foundation for your data integration investments.

Pentaho

Pentaho is a business intelligence (BI) software that provides data integration, OLAP

services, reporting, information dashboards, data mining and extract, transform, load

(ETL) capabilities.

Pentaho Data Integration (PDI) provides the Extract, Transform, and Load (ETL)

capabilities that facilitates the process of capturing, cleansing, and storing data using

a uniform and consistent format that is accessible and relevant to end users and IoT

technologies.

https://www.attunity.com/products/cloudbeam/attunity-cloudbeam-azure/

https://www.attunity.com/products/replicate/attunity-replicate-for-microsoft-migration/

https://www.attunity.com/products/replicate/attunity-replicate-for-microsoft-migration/

https://www.informatica.com/gb/



Common uses of Pentaho Data Integration include:

• Data migration between different databases and applications

• Loading huge data sets into databases taking full advantage of cloud, clustered

and massively parallel processing environments

• Data Cleansing with steps ranging from very simple to very complex

transformations

• Data Integration including the ability to leverage real-time ETL as a data source

for Pentaho Reporting

• Data warehouse population with built-in support for slowly changing dimensions

and surrogate key creation

It is available in Azure Marketplace and has connectors available for Azure services

such as HDInsight. See https://www.ashnik.com/pentaho-cloud-deployment-with-

microsoft-azure/ for more details.

Talend

Talend Cloud is a unified, comprehensive, and highly scalable integration platform

as-a-service (iPaaS) that makes it easy to collect, govern, transform, and share data.

Within a single interface, you can use Big Data integration, Data Preparation, API

Services and Data Stewardship applications to provide trusted, governed data across

your organization. It offers over 900 connectors and components, built-in data

quality, and native support for the latest big data and cloud technologies, and

software development lifecycle (SDLC) support for enterprises, at a predictable price.

With just a few clicks, you can deploy the remote engine to run integration tasks

natively with your Azure account from cloud to cloud, on-premises to cloud or cloud

to on-premises, completely within the customer’s environment for enhanced

performance and security. See https://www.talend.com/solutions/information-

technology/azure-cloud-integration/ for more information.

Talend can leverage Azure tools such as PolyBase to guarantee the most efficient

data loading into Azure Synapse.

https://www.ashnik.com/pentaho-cloud-deployment-with-microsoft-azure/

https://www.ashnik.com/pentaho-cloud-deployment-with-microsoft-azure/

https://www.talend.com/solutions/information-technology/azure-cloud-integration/

https://www.talend.com/solutions/information-technology/azure-cloud-integration/



See https://www.talend.com/blog/2017/02/08/leverage-load-data-microsoft-azure-

sql-data-warehouse-using-polybase-talend-etl/ for details.

Wherescape

WhereScape® RED automation software is an integrated development environment

that provides teams the automation to streamline workflows, eliminate hand-coding

and cut the time to develop, deploy and operate data infrastructure, such as data

warehouses, data vaults, data marts and data lakes by as much as 80%.

WhereScape automation is tailored for use with Microsoft SQL Server, Microsoft

Azure SQL Database, Microsoft Azure Synapse and Microsoft Analytics Platforms

System (PDW). See https://www.wherescape.com for full details.

https://www.talend.com/blog/2017/02/08/leverage-load-data-microsoft-azure-sql-data-warehouse-using-polybase-talend-etl/

https://www.talend.com/blog/2017/02/08/leverage-load-data-microsoft-azure-sql-data-warehouse-using-polybase-talend-etl/

https://www.wherescape.com/



Data loading from Teradata

Choices available when loading data from Teradata

When it comes to migrating the data from a Teradata data warehouse, there are a

few basic questions associated with data loading that need to be resolved. These

involve deciding how the data will be physically moved from the existing on-premise

Teradata environment into the new Azure Synapse in the cloud, and which tools will

be used to perform the transfer and load.

• Will the data be extracted to files or moved directly via network?

• Will the process be orchestrated from the source system or from the Azure

target environment?

• Which tools can be used to automate and manage the process?

Transfer data via files or network connection?

Once the database tables to be migrated have been created in Azure Synapse, the

data to populate these tables will be moved out of the legacy Teradata system and

loaded into the new environment. There are 2 basic approaches:

• File Extract – In this case the data from the Teradata tables is extracted to flat

files (normally in ‘Comma Separated Variable’ (CSV) format) via BTEQ, FastExport

or Teradata Parallel Transporter (TPT). TPT should be used where possible as it is

most efficient in terms of data throughput.

This approach requires space to ‘land’ the data files that are extracted – this

space could be ‘local’ to the Teradata source database (if sufficient storage is

available) or remotely in Azure Blob Storage. The best performance is generally

achieved when the file is written locally, avoiding any network overhead.

To minimize the storage and network transfer requirements, it is good practice

to compress the extracted data files using a utility such as gzip.

Once extracted, the flat files can be either be moved into Azure Blob Storage

(co-located with the target Azure Synapse instance), or loaded directly into Azure

Synapse via PolyBase. The method of physically moving the data from local on-

premise storage to the Azure cloud environment depends on the amount of data

to be moved and the network bandwidth available.

Microsoft provides various options to move large volumes of data Including

AzCopy (for moving files across the network into Azure Storage), Azure

ExpressRoute for moving bulk data over a private network connection, or Azure

Data Box where the files are moved to a physical storage device which is then

shipped to an Azure data center for loading. See https://docs.microsoft.com/en-

us/azure/architecture/data-guide/scenarios/data-transfer for more details.

• Direct extract and load across network – In this case, the target Azure

environment sends a data extract request (normally via a SQL command) to the

legacy Teradata system to extract the data and the results are sent across the

network and loaded directly into Azure Synapse, with no need to ‘land’ the data

into intermediate files. The limiting factor in this scenario is normally the

bandwidth of the network connection between the Teradata database and the

Understand data volumes

to be migrated and

available network

bandwidth as these

impact the approach

decision

https://docs.microsoft.com/en-us/azure/architecture/data-guide/scenarios/data-transfer

https://docs.microsoft.com/en-us/azure/architecture/data-guide/scenarios/data-transfer



Azure environment. For very large data volumes this approach may not be

practical.

A hybrid approach involving both methods is sometimes used – e.g. use the direct

network extract approach for smaller dimension tables and samples of the larger fact

tables to quickly provide a test environment in Azure Synapse while using the file

extract and transfer via Azure Data Box for the large volume historical fact tables.

Orchestrate from Teradata or Azure?

The recommended approach when moving to Azure Synapse is to orchestrate the

data extract and loading from the Azure environment using Azure Data Factory and

associated utilities (e.g. PolyBase for most efficient data loading). This approach

leverages the Azure capabilities and provides an easy method to build reusable data

loading pipelines.

Other benefits of this approach include reduced impact on the Teradata system

during the data load process (as the management and loading process is running in

Azure) and the ability to automate the process by using metadata-driven data load

pipelines.

Which tools can be used?

The task of data transformation and movement is the basic function of all ETL

products such as Informatica and also more modern data warehouse automation

products such as Wherescape. If one of these products is already in use in the

existing Teradata environment, then the migration task of moving the data from

Teradata to Azure Synapse may be simplified by using the existing ETL tool. This

assumes that the ETL tool supports Azure Synapse as a target environment (most

modern tools do).

Even if there isn’t an existing ETL tool in place, it is worth considering using a tool to

simplify the migration task. Tools such as Attunity Replicate (see

https://www.attunity.com/products/replicate/ ) are designed to simplify the task of

data migration.

Finally, if using an ETL tool, consider running that tool within the Azure environment

as this will benefit from Azure cloud performance, scalability and cost while also

freeing up resources in the Teradata data center.

https://www.attunity.com/products/replicate/



Summary

To summarize the recommendations when migrating data and associated ETL

processes from Teradata to Azure Synapse:

• Planning is essential to ensure a successful migration exercise

• Build a detailed inventory of data and processes to be migrated as soon as

possible

• Use system metadata and logfiles to get an accurate understanding of data and

process usage (documentation may be out of date)

• Understand the data volumes to be migrated, and also the network bandwidth

between the on-premise data center and Azure cloud environments

• Consider using a Teradata instance in an Azure VM as a ‘stepping stone’ to

offload migration from the legacy Teradata environment

• Leverage standard ‘built-in’ Azure features where appropriate to minimize the

migration workload

• Understand the most efficient tools for data extract and load etc. in both

Teradata and Azure environments and use the appropriate tools at each phase of

the process

• Use Azure facilities such as Azure Data Factory to orchestrate and automate the

migration process while minimizing impact on the Teradata system

migration to azure synapse analytics

Documents