migration to azure synapse analytics
TRANSCRIPT
Migration to Azure Synapse Analytics
Section 2.1 – Data migration, ETL and load for
Teradata migrations
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 2
Table of contents
Context ........................................................................................................................ 3
Overview ..................................................................................................................... 4
Data migration considerations ................................................................................. 6
Initial decisions regarding data migration from Teradata .................................................. 6
Migrate unused tables? ............................................................................................................. 6
What’s the best migration approach to minimize risk and impact on users? ...... 7
Using a VM Teradata instance as part of a migration.................................................... 7
Migrating data marts – stay physical or go virtual?? ...................................................... 7
Data migration from teradata ........................................................................................................ 9
Understand your data ................................................................................................................. 9
Teradata data type mapping ................................................................................................... 9
ETL migration considerations ................................................................................. 13
Initial decisions regarding Teradata ETL migration ............................................................ 13
Re-engineering existing Teradata-specific scripts .............................................................. 14
Using existing 3rd party ETL tools .............................................................................................. 15
Ab Initio ......................................................................................................................................... 15
Attunity .......................................................................................................................................... 16
Informatica ................................................................................................................................... 16
Pentaho ......................................................................................................................................... 16
Talend ............................................................................................................................................ 17
Wherescape ................................................................................................................................. 18
Data loading from Teradata .................................................................................... 19
Choices available when loading data from Teradata ........................................................ 19
Transfer data via files or network connection? .............................................................. 19
Orchestrate from Teradata or Azure? ................................................................................ 20
Which tools can be used? ...................................................................................................... 20
Summary ................................................................................................................... 21
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 3
Context
This paper is one of a series of documents which discuss aspects of migrating legacy
data warehouse implementations to Azure Synapse. The focus of this paper is on
data migration, ETL and loading specifically from existing Teradata environments –
other topics such as ETL, recommended migration approach and advanced analytics
in the data warehouse are covered in separate documents. This document should be
read in conjunction with the ‘Section 2 – Data ETL and Load’ document which
discusses the general aspects of design and performance for migrations to Azure
Synapse.
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 4
Overview
Existing users of Teradata data warehouse systems are now looking to take
advantage of the innovations provided by newer environments (e.g. cloud, IaaS,
PaaS) and to delegate tasks such as infrastructure maintenance and platform
development to the cloud provider.
While there are similarities between Teradata and Azure Synapse in that both are
SQL databases designed to use massively parallel processing (MPP) techniques to
achieve high query performance on very large data volumes, there are also some
basic differences in approach:
• Legacy Teradata systems will usually be installed on-premise, using proprietary
hardware, whereas Azure Synapse is cloud based using Azure storage and
compute resources.
• Upgrading a Teradata configuration is a major task involving additional physical
hardware and a potentially lengthy database reconfiguration. Since storage and
compute resources are separate in the Azure environment these can easily be
scaled (upwards and downwards) independently leveraging the elastic scalability
capability.
• Azure Synapse can be paused or resized as required to reduce resource
utilization and therefore cost.
Microsoft Azure is a globally available, highly secure, scalable cloud environment
which includes Azure Synapse within an eco-system of supporting tools and
capabilities.
Azure Synapse provides best-of-breed relational database performance by using
techniques such as massively parallel processing (MPP) and automatic in-memory
caching – the results of this approach can be seen in independent benchmarks such
as the one run recently by GigaOm – see https://gigaom.com/report/data-
warehouse-cloud-benchmark/ which compares Azure Synapse to other popular
cloud data warehouse offerings. Customers who have already migrated to this
environment have seen many benefits including:
‘More than just a
database’ – the Azure
environment includes a
comprehensive set of
capabilites and tools
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 5
• Improved performance and price/performance
• Increased agility and shorter time to value
• Faster server deployment and application development
• Elastic scalability – only pay for actual usage
• Improved security/compliance
• Reduced storage and Disaster Recovery costs
• Lower overall TCO and better cost control (OPEX)
To maximize these benefits it is necessary to migrate existing (or new) data and
applications to the Azure Synapse platform, and in many organizations this will
include migration of an existing data warehouse from legacy on-premise platforms
such as Teradata or Netezza. At a high level, the basic process will include the
following steps:
This paper looks at the data migration, ETL and loading aspects of migration from a
legacy Teradata data warehouse and data marts onto Azure Synapse. The topics
included in this paper apply specifically to migrations from an existing Teradata
environment.
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 6
Data migration considerations
Initial decisions regarding data migration from Teradata
When it comes to migrating a Teradata data warehouse, there are a few basic
questions associated with data that need to be asked. For example:
• Should unused table structures be migrated or not?
• What’s the best migration approach to minimize risk and impact for users?
• Migrating data marts – stay physical or go virtual?
The next sections discuss these points within the context of a migration from
Teradata.
Migrate unused tables?
It generally makes sense to only migrate the tables that are actually in use in the
existing system. Tables which are not active can be archived rather than migrated so
that the data is available if required in future. It’s best to use system metadata and
logfiles to determine which tables are in use, rather than documentation as
documentation may be out of date.
Teradata system catalog tables and logs contain information which can be used to
determine when a given table was last accessed – which in turn can be used to
decide whether or not a table is a candidate for migration.
A simple query on dbc.tables can provide the date of last access and last
modification, for example:
Select TableName, CreatorName, CreateTimeStamp, LastAlterName, LastAlterTimeStamp, AccessCount, LastAccessTimeStamp from DBC.Tables t Where DataBaseName = 'databasename'
If logging has been enabled and log history is accessible, much more information
including SQL query text is also available in table DBQLogTbl and the associated
logging tables. See
https://docs.teradata.com/reader/wada1XMYPkZVTqPKz2CNaw/PuQUxpyeCx4jvP8X
CiEeGA for more details.
In legacy systems it is not
unusual for tables to
become redundant over
time – these don’t need
to be migrated in most
cases
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 7
What’s the best migration approach to minimize risk and impact on
users?
This question comes up a lot as companies may want to lower the impact of
changes to the data warehouse data model to improve agility, and would like to
seize the opportunity to do so during a migration to modernize their data model.
This carries higher risk as it would almost certainly impact ETL jobs populating the
data warehouse and also taking from a data warehouse to feed dependent data
marts. Therefore it may be better to do a re-design of this kind of scale after data
warehouse migration.
Even if a data model change is to be part of the overall migration, it is good practice
the migrate the existing model ‘as-is’ to the new environment (Azure Synapse in this
case), then do any re-engineering on the new platform. This approach has the
advantage of minimizing the impact on existing production systems while also
leveraging the performance and elastic scalability of the Azure platform for one-off
re-engineering tasks.
When migrating from Teradata, there is also the option of creating a Teradata
environment in a VM within Azure as a ‘stepping stone’ in the migration process.
Using a VM Teradata instance as part of a migration
One optional approach for running a migration from an on-premise Teradata
environment is to leverage the Azure environment which provides cheap cloud
storage and elastic scalability to create a Teradata instance within a VM in Azure, co-
located with the target Azure Synapse environment.
With this approach, standard Teradata utilities such as Teradata Parallel Data
Transporter (or 3rd party data replication tools such as Attunity Replicate) can be
used to efficiently move the subset of Teradata tables which are to be migrated onto
the VM instance, and then all migration tasks can take place within the Azure
environment. This approach has several benefits:
• After the initial replication of data, the source system is not impacted by the
migration tasks
• The familiar Teradata interfaces, tools and utilities are available within the Azure
environment
• Once in the Azure environment there are no potential issues with network
bandwidth availability between the on-premise source system and the cloud
target system
• Tools such as Azure Data Factory can efficiently call utilities such as Teradata
Parallel Transporter to migrate data quickly and easily
• The migration process is orchestrated and controlled entirely within the Azure
environment
Migrating data marts – stay physical or go virtual??
In legacy Teradata data warehouse environments it is common practice to create a
number of data marts which are structured to provide good performance for ad hoc
self service queries and reports for a given department or business function within
Migrate the existing
model ‘as-is’ initially even
if a change to the data
model is planned for the
future
Virtualizing data marts
can save on storage and
processing resources
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 8
an organization. As such, a data mart typically consists of a subset of the data
warehouse containing aggregated versions of the data in a form that enables users
to easily query that data with fast response times via user-friendly query tools such
as Microsoft Power BI, Tableau or Microstrategy. This form is generally a dimensional
data model, and one use of data marts is to expose the data in a usable form even if
the underlying warehouse data model is something different (e.g. data vault).
Separate data marts for individual business units within an organization can also be
used to implement robust data security regimes, by only allowing user access to
specific data marts relevant to them, and eliminating, obfuscating or anonymizing
sensitive data.
If these data marts are implemented as physical tables, they required additional
storage resources to store them and also additional processing to build and refresh
them on a regular basis. It also implies that the data in the mart is only as up to date
as the last refresh operation – so it may not be suitable for highly volatile data
dashboards.
With the advent of relatively low cost scalable MPP architectures such as Azure
Synapse and their inherent performance characteristics, it may be that data mart
functionality can be provided without having to instantiate the mart as a set of
physical tables. This is achieved by effectively virtualizing the data marts via SQL
views on to the main data warehouse or via a virtualization layer using features such
as views in Azure or 3rd party virtualization products such as Denodo. This approach
simplifies or eliminates the need for additional storage and aggregation processing,
and reduces the overall number of database objects to be migrated.
There is also another potential benefit of this approach – by implementing the
aggregation and join logic within a virtualization layer and presenting external
reporting tools via a virtualized view, the processing required to create these views is
‘pushed down’ into the data warehouse, which is generally the best place to run joins
and aggregations etc. on large data volumes.
The primary drivers for choosing to implement physical or virtual data mart
implementation are:
• More agility as a virtual data mart is easier to change than physical tables and
the associated ETL processes
• Lower total cost of ownership because of fewer data stores and copies of data in
a virtualized implementation
• Elimination of ETL jobs to migrate and simplified DW architecture in a virtualized
environment
• Performance – historically physical data marts have been more performant,
though virtualization products are now implementing intelligent caching
techniques to mitigate this
The performance and
scalability of Azure
Synapse enables
virtualization without
sacrificing performance
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 9
Data migration from Teradata
Understand your data
Part of the migration planning should be to understand in detail the volume of data
to be migrated as this can impact decisions on the migration approach to take. Use
system metadata to determine the physical space taken up by the ‘raw data’ within
the tables to be migrated. In this context ‘raw data’ means the amount of space used
by the data rows within a table excluding overheads such as indexes and any
compression. This is especially true for the largest fact tables as these will typically
comprise > 95% of the data.
A good way to get an accurate number for this for a given table is to extract a
representative sample of the data (e.g. 1 million rows) to an uncompressed delimited
flat ASCII data file and use the size of that to give an average raw data size per row
of that table. Multiply this average size by the total number of rows in the full table
to give a raw data size for that table and use this figure in planning.
Teradata data type mapping
Some Teradata datatypes are not directly supported in Azure Synapse – below is a
table which shows these data types together with the recommended approach for
handling these.
In the table below, Teradata Column Type is the type that is stored within the
system catalog (e.g. in DBC.ColumnsV).
Teradata
Column Type
Teradata
Data Type ASDW Data Type
++ TD_ANYTYPE Not supported in Azure Synapse
A1 ARRAY Not supported in Azure Synapse
AN ARRAY Not supported in Azure Synapse
AT TIME TIME
BF BYTE BINARY
BO BLOB BLOB data type isn't directly supported but
can be replaced with BINARY
BV VARBYTE BINARY
CF CHAR VARCHAR
CO CLOB CLOB data type isn't directly supported but
can be replaced with VARCHAR
CV VARCHAR VARCHAR
D DECIMAL DECIMAL
DA DATE DATE
DH INTERVAL DAY TO
HOUR
INTERVAL data types aren't supported in
Azure Synapse. Date calculations can be done
with the date comparison functions (e.g.
DATEDIFF and DATEADD)
DM INTERVAL DAY TO
MINUTE
INTERVAL data types aren't supported in Azure
Synapse. Date calculations can be done with the
date comparison functions (e.g. DATEDIFF and
DATEADD)
Assess the impact of
unsupported data types
as part of the preparation
phase
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 10
DS INTERVAL DAY TO
SECOND
INTERVAL data types aren't supported in Azure
Synapse. Date calculations can be done with the
date comparison functions (e.g. DATEDIFF and
DATEADD)
DT DATASET DATASET data type isn't supported in Azure
Synapse.
DY INTERVAL DAY INTERVAL data types aren't supported in Azure
Synapse. Date calculations can be done with the
date comparison functions (e.g. DATEDIFF and
DATEADD)
F FLOAT FLOAT
HM INTERVAL HOUR TO
MINUTE
INTERVAL data types aren't supported in Azure
Synapse. Date calculations can be done with the
date comparison functions (e.g. DATEDIFF and
DATEADD)
HR INTERVAL HOUR INTERVAL data types aren't supported in Azure
Synapse. Date calculations can be done with the
date comparison functions (e.g. DATEDIFF and
DATEADD)
HS INTERVAL HOUR TO
SECOND
INTERVAL data types aren't supported in Azure
Synapse. Date calculations can be done with the
date comparison functions (e.g. DATEDIFF and
DATEADD)
I1 BYTEINT TINYINT
I2 SMALLINT SMALLINT
I8 BIGINT BIGINT
I INTEGER INT
JN JSON JSON data type is not currently directly
supported within Azure Synapse but JSON data
can be stored in a VARCHAR field
MI INTERVAL MINUTE INTERVAL data types aren't supported in Azure
Synapse. Date calculations can be done with the
date comparison functions (e.g. DATEDIFF and
DATEADD)
MO INTERVAL MONTH INTERVAL data types aren't supported in Azure
Synapse. Date calculations can be done with the
date comparison functions (e.g. DATEDIFF and
DATEADD)
MS INTERVAL MINUTE
TO SECOND
INTERVAL data types aren't supported in Azure
Synapse. Date calculations can be done with the
date comparison functions (e.g. DATEDIFF and
DATEADD)
N NUMBER NUMERIC
PD PERIOD(DATE) Can be converted to VARCHAR or split into 2
separate dates
PM PERIOD(TIMESTAM
P WITH
TIME ZONE)
Can be converted to VARCHAR or split into 2
separate timestamps (DATETIMEOFFSET).
PS PERIOD(TIMESTAM
P)
Can be converted to VARCHAR or split into 2
separate timestamps (DATETIMEOFFSET).
PT PERIOD(TIME) Can be converted to VARCHAR or split into 2
separate times.
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 11
PZ PERIOD(TIME WITH
TIME ZONE)
Can be converted to VARCHAR or split into 2
separate times but WITH TIME ZONE isn't
supported for TIME.
SC INTERVAL SECOND INTERVAL data types aren't supported in
Azure Synapse but date calculations can be
done with the date comparison functions
(e.g. DATEDIFF and DATEADD)
SZ TIMESTAMP WITH
TIME ZONE
DATETIMEOFFSET
TS TIMESTAMP DATETIME or DATETIME2
TZ TIME WITH TIME
ZONE
TIME WITH TIME ZONE isn't supported
because TIME is stored using "wall clock"
time only without a time zone offset
XM XML XML data type is not currently directly
supported within Azure Synapse but XML data
can be stored in a VARCHAR field
YM INTERVAL YEAR TO
MONTH
INTERVAL data types aren't supported in Azure
Synapse. Date calculations can be done with the
date comparison functions (e.g. DATEDIFF and
DATEADD)
YR INTERVAL YEAR INTERVAL data types aren't supported in Azure
Synapse. Date calculations can be done with the
date comparison functions (e.g. DATEDIFF and
DATEADD)
Use the metadata from the Teradata catalog tables to determine whether any of
these data types are to be migrated and allow for this in the migration plan. E.g a
SQL query such as the one below can be used to find any occurences of
unsupported data types which need attention:
SELECT ColumnType, CASE WHEN ColumnType = '++' THEN 'TD_ANYTYPE' WHEN ColumnType = 'A1' THEN 'ARRAY' WHEN ColumnType = 'AN' THEN 'ARRAY' WHEN ColumnType = 'BO' THEN 'BLOB' WHEN ColumnType = 'CO' THEN 'CLOB' WHEN ColumnType = 'DH' THEN 'INTERVAL DAY TO HOUR' WHEN ColumnType = 'DM' THEN 'INTERVAL DAY TO MINUTE' WHEN ColumnType = 'DS' THEN 'INTERVAL DAY TO SECOND' WHEN ColumnType = 'DT' THEN 'DATASET' WHEN ColumnType = 'DY' THEN 'INTERVAL DAY' WHEN ColumnType = 'HM' THEN 'INTERVAL HOUR TO MINUTE' WHEN ColumnType = 'HR' THEN 'INTERVAL HOUR' WHEN ColumnType = 'HS' THEN 'INTERVAL HOUR TO SECOND' WHEN ColumnType = 'JN' THEN 'JSON' WHEN ColumnType = 'MI' THEN 'INTERVAL MINUTE' WHEN ColumnType = 'MO' THEN 'INTERVAL MONTH' WHEN ColumnType = 'MS' THEN 'INTERVAL MINUTE TO SECOND' WHEN ColumnType = 'PD' THEN 'PERIOD(DATE)' WHEN ColumnType = 'PM' THEN 'PERIOD(TIMESTAMP WITH TIME ZONE)'
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 12
WHEN ColumnType = 'PS' THEN 'PERIOD(TIMESTAMP)' WHEN ColumnType = 'PT' THEN 'PERIOD(TIME)' WHEN ColumnType = 'PZ' THEN 'PERIOD(TIME WITH TIME ZONE)' WHEN ColumnType = 'SC' THEN 'INTERVAL SECOND' WHEN ColumnType = 'SZ' THEN 'TIMESTAMP WITH TIME ZONE' WHEN ColumnType = 'XM' THEN 'XML' WHEN ColumnType = 'YM' THEN 'INTERVAL YEAR TO MONTH' WHEN ColumnType = 'YR' THEN 'INTERVAL YEAR' END AS Data_Type, COUNT(*) AS Data_Type_Count FROM DBC.ColumnsV WHERE DatabaseName IN ('UserDB1', 'UserDB2', 'UserDB3') -- select databases to be migrated GROUP BY 1,2 ORDER BY 1;
There are 3rd party vendors who offer tools and services to automate migration
including the mapping of data types as described above. Also, if a 3rd party ETL tool
such as Informatica or Talend is already in use in the Teradata environment, these
can implement any required data transformations. The next section explores
migration of existing 3rd party ETL processes.
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 13
ETL migration considerations
Initial decisions regarding Teradata ETL migration
For ETL/ELT processing, legacy Teradata data warehouses may use custom-built
scripts using Teradata utilities such as BTEQ and Teradata Parallel Transporter (TPT),
or a 3rd party ETL tool such as Informatica or Ab Initio. Sometimes there is a
combination of both approaches that has evolved over time. When planning a
migration to Azure Synapse, the question is how best to implement the required
ETL/ELT processing in the new environment, while minimizing the cost and risk
involved.
The sections below discuss the options available and make some recommendations
for the various use cases. One way to decide on the approach to take can be
summarized by the flowchart below:
The initial step should always be to build an inventory of ETL/ELT processes to be
migrated – again, as with other steps, it may be that standard ‘built-in’ Azure
features mean that some existing processes need not be migrated. For planning
purposes it is important to understand the scale of the migration to be performed.
In the flowchart above, decision 1 relates to the high-level question of whether there
has already been a decision to move to a totally Azure-native environment. If so,
then the recommendation is to re-engineer the ETL processing using Azure Data
Factory (ADF) and associated utilities.
If that is not the case, then decision 2 is whether or not an existing 3rd party ETL tool
is already in use. In the Teradata environment, some (or all) of the ETL processing
may be performed by custom scripts using Teradata-specific utilities such as BTEQ
and TPT. The approach in this case is again to re-engineer using ADF.
If a 3rd party ETL tool such as Informatica or Ab Initio is already in use, (and
especially if there is a large investment in skills and a large number of existing
workflows and schedules in place using that tool) then decision 3 is based on
whether the tool can efficiently support Azure Synapse as a target environment.
Ideally the tool will include ‘native’ connectors which can leverage Azure facilities
Plan the approach for
ETL migration ahead of
time and leverage Azure
facilities where
appropriate
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 14
such as PolyBase for the most efficient parallel data loading, but even if these are not
in place there is generally a way of calling an external process (i.e. PolyBase) and
passing the appropriate parameters. In this case the existing skills and workflows can
be leveraged, with the new Azure Synapse becoming the target environment.
If retaining an existing 3rd party ETL tool, there may be benefits to running that tool
within the Azure environment (rather than an existing on-premise ETL server) and
also that the overall orchestration of the existing workflows could be handled by
Azure Data Factory. So decision 4 is whether to leave the existing tool running ‘as-is’
or to move it into the Azure environment to gain cost, performance and scalability
benefits.
Re-engineering existing Teradata-specific scripts
If some or all of the existing Teradata warehouse ETL/ELT processing is handled by
custom scripts which utilize Teradata-specific utilities such as BTEQ, MLOAD or TPT
then these need to be re-coded for the new Azure Synapse environment. Similarly, if
ETL processes have been implemented using stored procedures in Teradata, these
will also have to be recoded.
Some elements of the ETL process are relatively easy to migrate (e.g. simple bulk
data loads into a staging table from an external file) and it may even be possible to
automate these parts of the process, for example using PolyBase instead of Fastload
or MLOAD. Other parts of the process which contain arbitrary complex SQL and/or
stored procedures will take more time to re-engineer.
One way of testing Teradata SQL for compatibility with Azure Synapse is to capture
some representative SQL statements from Teradata logs, then prefix those queries
with ‘EXPLAIN ‘ and then (assuming a like for like migrated data model in Azure
Synapse) run those EXPLAIN statements in Azure Synapse. Any incompatible SQL will
give an error – and this information can be used to determine the scale of the re-
coding task.
In the worst case this will mean manual re-coding, but there are also products and
services available from Microsoft partners to assist with this process. For example
Datometry Hyper-Q (see https://datometry.com/) runs between the legacy
applications and Azure Synapse to translate Teradata queries into Azure Synapse T-
SQL at the network layer.
Consider running ETL
tools in Azure to leverage
performance, scalability
and cost benefits
The inventory of ETL
tasks to be migrated
should include scripts
and stored procedures
Use EXPLAIN to find SQL
incompatibilities
Partners offer products
and skills to assist in re-
engineering Teradata-
specific code
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 15
Inspirer also offer tools and services to migrate Teradata SQL and stored procedures
etc. to Azure Synapse – see https://www.ispirer.com/products/teradata-to-azure-sql-
data-warehouse-migration
Using existing 3rd party ETL tools
As described in the section above, in many cases the existing legacy data warehouse
system will already be populated and maintained by a 3rd party ETL product such as
Informatica or Talend. See https://docs.microsoft.com/en-us/azure/sql-data-
warehouse/sql-data-warehouse-partner-data-integration for a list of current
Microsoft data integration partners for Azure Synapse.
There are several popular ETL products which are frequently used in the Teradata
community (some of which are already Microsoft partners listed at the link above).
The following paragraphs discuss the most popular ETL tools currently in use with
Teradata warehouses. All of these products can be run within a VM in Azure, and can
read and write Azure databases and files.
Ab Initio
Ab Initio is a Business Intelligence platform comprised of six data processing
products: Co>Operating System, The Component Library, Graphical Development
Environment, Enterprise Meta>Environment, Data Profiler, and Conduct>It. It is a
powerful GUI-based parallel processing tool for ETL data management and analysis.
Ab Initio is popular in Teradata environments as it can parse Teradata BTEQ scripts
automatically when they are executed using the “Execute SQL” component, which
allows the use of Ab Initio as a primary execution engine for an 'extract, transform,
load and transform' (ETLT) approach.
Ab Initio can run in a VM within Azure and can read and write Azure Storage for files
and SQL Server databases via the ‘Input Table’, ‘Output Table’ and ‘Run SQL’ Ab
Initio components.
Leverage investment in
existing 3rd party tools
to reduce cost and risk
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 16
Attunity
Attunity CloudBeam for Azure Synapse enables automated and optimized data
loading from many enterprise databases into the Microsoft Azure Synapse - quickly,
easily and affordably. It is available in the Microsoft Azure Marketplace – see
https://www.attunity.com/products/cloudbeam/attunity-cloudbeam-azure/ for more
details.
Attunity Replicate for Microsoft Migrations is for Microsoft customers who want to
migrate data from popular commercial and open-source databases to the Microsoft
Data Platform, including Teradata to Azure Synapse. It can be obtained from
https://www.attunity.com/products/replicate/attunity-replicate-for-microsoft-
migration/.
Attunity benefits for Azure Synapse include:
• Continuous database to Azure Synapse loading
• Quick transfer speeds with guaranteed delivery
• Intuitive administration and scheduling
• Data integrity assurance by way of check mechanisms
• Monitoring for peace-of-mind, control, and auditing
• Industry-standard SSL encryption for security
Informatica
Informatica (see https://www.informatica.com/gb/ ) has 2 offerings which are
available in Azure Marketplace:
Informatica Cloud Services for Azure offers a best-in-class solution for
self-service data migration, integration, and management capabilities.
Customers can quickly and reliably import, and export petabytes of data to
Azure from a variety of sources. Informatica Cloud Services for Azure
provides native, high volume, high-performance connectivity to Azure
Synapse, SQL Database, Blob Storage, Data Lake Store, and Azure Cosmos
DB.
Informatica PowerCenter is a metadata-driven data integration platform
that jumpstarts and accelerates data integration projects in order to deliver
data to the business more quickly than manual hand coding. It serves as the
foundation for your data integration investments.
Pentaho
Pentaho is a business intelligence (BI) software that provides data integration, OLAP
services, reporting, information dashboards, data mining and extract, transform, load
(ETL) capabilities.
Pentaho Data Integration (PDI) provides the Extract, Transform, and Load (ETL)
capabilities that facilitates the process of capturing, cleansing, and storing data using
a uniform and consistent format that is accessible and relevant to end users and IoT
technologies.
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 17
Common uses of Pentaho Data Integration include:
• Data migration between different databases and applications
• Loading huge data sets into databases taking full advantage of cloud, clustered
and massively parallel processing environments
• Data Cleansing with steps ranging from very simple to very complex
transformations
• Data Integration including the ability to leverage real-time ETL as a data source
for Pentaho Reporting
• Data warehouse population with built-in support for slowly changing dimensions
and surrogate key creation
It is available in Azure Marketplace and has connectors available for Azure services
such as HDInsight. See https://www.ashnik.com/pentaho-cloud-deployment-with-
microsoft-azure/ for more details.
Talend
Talend Cloud is a unified, comprehensive, and highly scalable integration platform
as-a-service (iPaaS) that makes it easy to collect, govern, transform, and share data.
Within a single interface, you can use Big Data integration, Data Preparation, API
Services and Data Stewardship applications to provide trusted, governed data across
your organization. It offers over 900 connectors and components, built-in data
quality, and native support for the latest big data and cloud technologies, and
software development lifecycle (SDLC) support for enterprises, at a predictable price.
With just a few clicks, you can deploy the remote engine to run integration tasks
natively with your Azure account from cloud to cloud, on-premises to cloud or cloud
to on-premises, completely within the customer’s environment for enhanced
performance and security. See https://www.talend.com/solutions/information-
technology/azure-cloud-integration/ for more information.
Talend can leverage Azure tools such as PolyBase to guarantee the most efficient
data loading into Azure Synapse.
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 18
See https://www.talend.com/blog/2017/02/08/leverage-load-data-microsoft-azure-
sql-data-warehouse-using-polybase-talend-etl/ for details.
Wherescape
WhereScape® RED automation software is an integrated development environment
that provides teams the automation to streamline workflows, eliminate hand-coding
and cut the time to develop, deploy and operate data infrastructure, such as data
warehouses, data vaults, data marts and data lakes by as much as 80%.
WhereScape automation is tailored for use with Microsoft SQL Server, Microsoft
Azure SQL Database, Microsoft Azure Synapse and Microsoft Analytics Platforms
System (PDW). See https://www.wherescape.com for full details.
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 19
Data loading from Teradata
Choices available when loading data from Teradata
When it comes to migrating the data from a Teradata data warehouse, there are a
few basic questions associated with data loading that need to be resolved. These
involve deciding how the data will be physically moved from the existing on-premise
Teradata environment into the new Azure Synapse in the cloud, and which tools will
be used to perform the transfer and load.
• Will the data be extracted to files or moved directly via network?
• Will the process be orchestrated from the source system or from the Azure
target environment?
• Which tools can be used to automate and manage the process?
Transfer data via files or network connection?
Once the database tables to be migrated have been created in Azure Synapse, the
data to populate these tables will be moved out of the legacy Teradata system and
loaded into the new environment. There are 2 basic approaches:
• File Extract – In this case the data from the Teradata tables is extracted to flat
files (normally in ‘Comma Separated Variable’ (CSV) format) via BTEQ, FastExport
or Teradata Parallel Transporter (TPT). TPT should be used where possible as it is
most efficient in terms of data throughput.
This approach requires space to ‘land’ the data files that are extracted – this
space could be ‘local’ to the Teradata source database (if sufficient storage is
available) or remotely in Azure Blob Storage. The best performance is generally
achieved when the file is written locally, avoiding any network overhead.
To minimize the storage and network transfer requirements, it is good practice
to compress the extracted data files using a utility such as gzip.
Once extracted, the flat files can be either be moved into Azure Blob Storage
(co-located with the target Azure Synapse instance), or loaded directly into Azure
Synapse via PolyBase. The method of physically moving the data from local on-
premise storage to the Azure cloud environment depends on the amount of data
to be moved and the network bandwidth available.
Microsoft provides various options to move large volumes of data Including
AzCopy (for moving files across the network into Azure Storage), Azure
ExpressRoute for moving bulk data over a private network connection, or Azure
Data Box where the files are moved to a physical storage device which is then
shipped to an Azure data center for loading. See https://docs.microsoft.com/en-
us/azure/architecture/data-guide/scenarios/data-transfer for more details.
• Direct extract and load across network – In this case, the target Azure
environment sends a data extract request (normally via a SQL command) to the
legacy Teradata system to extract the data and the results are sent across the
network and loaded directly into Azure Synapse, with no need to ‘land’ the data
into intermediate files. The limiting factor in this scenario is normally the
bandwidth of the network connection between the Teradata database and the
Understand data volumes
to be migrated and
available network
bandwidth as these
impact the approach
decision
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 20
Azure environment. For very large data volumes this approach may not be
practical.
A hybrid approach involving both methods is sometimes used – e.g. use the direct
network extract approach for smaller dimension tables and samples of the larger fact
tables to quickly provide a test environment in Azure Synapse while using the file
extract and transfer via Azure Data Box for the large volume historical fact tables.
Orchestrate from Teradata or Azure?
The recommended approach when moving to Azure Synapse is to orchestrate the
data extract and loading from the Azure environment using Azure Data Factory and
associated utilities (e.g. PolyBase for most efficient data loading). This approach
leverages the Azure capabilities and provides an easy method to build reusable data
loading pipelines.
Other benefits of this approach include reduced impact on the Teradata system
during the data load process (as the management and loading process is running in
Azure) and the ability to automate the process by using metadata-driven data load
pipelines.
Which tools can be used?
The task of data transformation and movement is the basic function of all ETL
products such as Informatica and also more modern data warehouse automation
products such as Wherescape. If one of these products is already in use in the
existing Teradata environment, then the migration task of moving the data from
Teradata to Azure Synapse may be simplified by using the existing ETL tool. This
assumes that the ETL tool supports Azure Synapse as a target environment (most
modern tools do).
Even if there isn’t an existing ETL tool in place, it is worth considering using a tool to
simplify the migration task. Tools such as Attunity Replicate (see
https://www.attunity.com/products/replicate/ ) are designed to simplify the task of
data migration.
Finally, if using an ETL tool, consider running that tool within the Azure environment
as this will benefit from Azure cloud performance, scalability and cost while also
freeing up resources in the Teradata data center.
Data migration, ETL and load for Teradata migrations
Copyright © Microsoft Corporation, 2019, All Rights Reserved 21
Summary
To summarize the recommendations when migrating data and associated ETL
processes from Teradata to Azure Synapse:
• Planning is essential to ensure a successful migration exercise
• Build a detailed inventory of data and processes to be migrated as soon as
possible
• Use system metadata and logfiles to get an accurate understanding of data and
process usage (documentation may be out of date)
• Understand the data volumes to be migrated, and also the network bandwidth
between the on-premise data center and Azure cloud environments
• Consider using a Teradata instance in an Azure VM as a ‘stepping stone’ to
offload migration from the legacy Teradata environment
• Leverage standard ‘built-in’ Azure features where appropriate to minimize the
migration workload
• Understand the most efficient tools for data extract and load etc. in both
Teradata and Azure environments and use the appropriate tools at each phase of
the process
• Use Azure facilities such as Azure Data Factory to orchestrate and automate the
migration process while minimizing impact on the Teradata system