ssis best practices
TRANSCRIPT
SSIS – Best Practices Introduction
About This Document
This document outlines some best practices for developing ETL solutions in SQL Server Integration
Services (SSIS). This document has leveraged best practices from multiple sources including technical
papers, blogs and experts in the tool.
The purpose of establishing best practices will provide the following benefits:
Reusable and flexible ETL architecture solution
Solutions to common problems
Tips and tricks
Audience
The primary audience for this document is developers who will be involved in the detailed design and
implementation of the ETL solution. Secondary audiences are the architects who will be involved in
validating the guiding principles used in the design and implementation.
Pre-Requisites
The audience should be familiar with SSIS.
Scope
The scope of this document covers the various best practices to be followed when developing ETL
solutions in SSIS.
This document does not include:
Introduction to ETL, Business Intelligence or Data Warehouse concepts
Best practices for an ETL solution (e.g. architecture, logging, auditing etc.)
Introduction to SSIS
How to use a specific component of SSIS
Package Versions
This document uses:
Microsoft SQL Server Integration Services Designer version 9.00.3033.00
Microsoft Visual Studio 2005 Version 8.0.50727.762
Microsoft .NET Framework Version 2.0.50727 SP1
Installed Edition: IDE Standard Microsoft Visual Studio 2005 Premier Partner Edition - ENU
Service Pack 1 (KB926601)
Architecture and Setup
Master-Child Packages
In most cases, the problem is large enough to separate the solution into multiple packages based on a
logical property or module (e.g. dimensions and facts, line of business, helper packages etc.). Having
multiple packages will also be efficient as developers can work in parallel.
The master package should have minimal to none business extraction logic. Its main purpose is to call
other packages based on dependencies of the data (dimensions before facts). The master package
would also be responsible for cycle run date range, auditing, global variable values, database
connection strings, error recovery etc. to pass to the child packages.
Master package example below:
Templates
Use templates to standardize package design, configuration, event handling and auditing.
An important item to remember when using a template package to create a new package is to generate
a new ID for the package. A new ID can be generated from the Properties window of the package. The
property is under Identification->ID->Generate New ID. SSIS may not log the package execution
correctly if they are sharing the same IDs.
To read more about the benefits of using templates, visit
http://blogs.conchango.com/jamiethomson/archive/2007/03/11/SSIS_3A00_-Package-Template.aspx
Configurations
Package Configurations
Package configurations are a mechanism for dynamically changing properties of your IS objects and
components at run-time using values that are stored externally to the package.
Properties for variables (e.g. values, name, read-only, expression etc.), event handlers (e.g. error
handling, system error variables etc.), connection managers (connection string, security parameters
etc.), logging (e.g. events, description) and other general (e.g. checkpoint usage, failure recovery,
version control, access etc.) are easily configurable in SSIS by using the package configuration wizard.
During run-time, SSIS applies the configurations to the package based on the method and its order
created in the wizard.
SSIS provides a number of methods to store and pass these values to a package:
SQL Server
XML file
Environment variable
Registry settings
Parent Package Variable
Each method should be used as described below.
SQL Server
This is a form of a direct configuration method. It is the recommended way to store most of your
package configuration values. SSIS stores the values in a table with a default schema at a package,
property and value grain.
The developer has flexibility and control over the configurations as it can easily modify the configuration
values prior to run-time (e.g. date range for which the package should run for). The package itself has
control over the configurations during run-time as it can easily modify the property values (e.g. the
global error count variable whenever an error is encountered by any packages in the solution) with an
update to the configuration table.
Use a variable to store the connection string to the configuration table database for a connection
manager. You would then only need to update this variable value by using an indirect configuration
method to specify the table connection properties.
By storing values in a SQL Server table, we have a central repository to review and modify variable
values and package properties without performing the tedious task of opening the package, modifying
the values or property within the package and re-deploying the package.
XML File
An effective direct method for storing configurations as well but more complicated to modify
configurations during run-time. Each package of a solution can have its configurations in one file
allowing a central location to change package properties.
One can also use the XML configuration file as an indirect configuration (storing the direct configuration
table connection value) by setting the file location as a static configuration entry in the package
configuration wizard. You will need to ensure that when deploying the solution to different
environments, you must deploy the configuration file in the same structure in those environments.
Environment Variable
Leverage indirect configuration by using it to store the XML configuration file location or the SQL Server
connection of the configuration table. This will allow you to have and easily switch between multiple
configuration sets by changing the environment variable value rather than modifying it at the package
level and re-building the package. Also, when deploying solutions to different environments, only the
environment variable would need to be changed to identify the new location or database connection for
the direct configuration.
Registry Settings
Similar usage as an environment variable except the direct configuration value is set in a registry key.
Parent Package Variable
Use it to pass global variable values from the parent package to the child package. Use it to pass
database connection string values, cycle run ID, audit data and other useful information across
packages. Ensure that you create a variable with the same data type in the child package else data
value compatibility errors may arise.
Configurable Packages
As read above, SSIS allows multiple ways to dynamically configure packages. This is a powerful
feature that can be used in multiple ways to make a package extensible and flexible. One should store
most of the values of variables used and expression property of a variable in the SQL Server
configuration table. Additionally, store the source queries used to extract data in a variable and
leverage the use of expressions to dynamically generate a query for a set of criteria. An example is
provided in the DynamicQuery package of the SSIS_Best_Practices solution.
Recommended Reading
SSIS Package Configuration in SQL Server 2005
http://www.sql-server-performance.com/articles/dba/package_configuration_2005_p1.aspx
Connection Managers For an overview of the plethora of connection managers available in SSIS, review the MS Connection
Manager white paper.
Native Providers
Experience tells us to use native providers. However, in the world of SSIS, native providers are usually
not the most efficient method to use. The white paper ‘SSIS 2008 Connectivity Options’ highlights
some of the third-party options available. You may need to install additional client connectors to
effectively connect to a database from SSIS (view the frequent error you may receive to connect to an
Oracle DB below).
Oracle
Oracle’s own provider for an Oracle DB is quite slow and inefficient for extracting data. The better
option is using the SSIS’ provider with a few performance tweaks to the registry of the client that will
host the packages.
In the registry structure - HKEY_LOCAL_MACHINE\SOFTWARE\ORACLE\OLEDB, change the values
of the following keys: FetchSize to ‘200’, CacheType to ‘File’, PLSQLRSet to‘1’.
For writing to an Oracle DB, the provider from Persistent has great reviews and is supposedly 100
times faster than other providers. Visit
http://www.persistentsys.com/ServiceOfferings/TimeToMarketAccelerators/Connectors/SSISOracleBulk
LoadConnector/tabid/195/Default.aspx for more information.
Microsoft has released new Oracle providers (included in SSIS 2008) which are supposedly much
faster and efficient than their previous counterparts. You can download them here:
http://www.microsoft.com/downloads/details.aspx?FamilyID=d9cb21fe-32e9-4d34-a381-
6f9231d84f1e&DisplayLang=en
A frequent error received while connecting to an Oracle DB from SSIS is “client and networking
components were not found”. The fix for this is to install the Oracle client for the machine you are
running SSIS from. This will install the correct libraries and drivers needed to connect to an Oracle
source.
Visit http://www.oracle.com/technology/software/tech/oci/instantclient/index.html to install the correct
client.
Recommended Reading
Connectivity and SQL Server 2005 Integration Services
http://technet.microsoft.com/en-us/library/bb332055.aspx
How to Choose the Best Connectors for SSIS
http://www.sqlservercentral.com/articles/SQL+Server+2005+-+SSIS/3209/
SSIS Data Sources Categories and Compatibility (highly recommended)
http://ssis.wik.is/Data_Sources
Data Flow Task The Data Flow Task is the most useful and important task. It facilitates the movement of data between
sources to destinations allowing complex transformations to take place within the flow. Eliminating
unnecessary work that the data flow task does is the single most important method to improve its
performance.
Using the highly recommended reference, http://www.simple-talk.com/sql/sql-server-2005/sql-server-
2005-ssis-tuning-the-dataflow-task/, I have categorized the performance improvement tips below. Do
note that I have added or removed some tips based on experience.
General
Increase DefaultBufferMaxSize and DefaultBufferMaxRows - Increasing the values for these
two properties can boost performance by decreasing the number of buffers moving through the
data flow. However, you should avoid increasing the values too much to the point where the
Execution Engine starts swapping out buffers to disk. That would defeat the purpose.
Use match indexes for repeat data cleansing sessions- When the package runs again, the
transformation can either use an existing match index or create a new index. If the reference
table is static, the package can avoid the potentially expensive process of rebuilding the index
for repeat sessions of data cleaning. If you choose to use an existing index, the index is
created the first time that the package runs. If multiple Fuzzy Lookup transformations use the
same reference table, they can all use the same index. To reuse the index, the lookup
operations must be identical; the lookup must use the same columns. You can name the index
and select the connection to the SQL Server database that saves the index.
Implement Parallel Execution - Both the Execution Engine for the Dataflow Task and the
Execution Engine for the Control Flow are multithreaded.
o Use EngineThreads Property – controls the number of worker threads the Execution
Engine will use. The default for this property is 5. However, as you now know, by
simply adding a few components, data flow thread requirements will quickly exceed the
default. Be aware of how many threads the data flow naturally needs and try to keep
the EngineThreads value reasonably close to it.
o Set MaxConcurrentExecutables – if there are multiple Dataflow Tasks in the Control
Flow, say 10 and MaxConcurrentExecutables is set to 4, only four of the Dataflow
Tasks will execute simultaneously. Set, test and measure various value combinations
of this property and the EngineThreads property to determine the optimal setting for
your packages.
Data Flow Sources
Other online links would urge you to use the DataReader Source as opposed to the OLE DB
Source for performance improvements. However, experience suggests that no credible
improvement is seen and one should use the OLE DB Source as the flexibility it provides to
move between environments and servers outweighs the negligible performance degradation.
Remove unnecessary columns - unneeded columns are columns that never get referenced in
the data flow. The Execution Engine emits warnings for unused columns, so they are easy to
identify. This makes the buffer rows narrower. The narrower the row, the more rows that can fit
into one buffer and the more efficiently the rows can be processed.
o NOTE: Buffers are objects with associated chunks of memory that contain the data to
be transformed. As data flows through the Dataflow Task, it lives in a buffer from the
time that the source adapter reads it until the time that the destination adapter writes it.
Binary Large Objects (BLOBs) are particularly burdensome to the Dataflow Task and
should be eliminated if at all possible. Use the queries in the source adapters to
eliminate unnecessary columns.
Use a SQL select statement to retrieve data from a view- avoid using the “Table or view”
access mode in the OLE DB Source Adapter. It is not as fast as using a SELECT statement
because the adapter opens a rowset-based on the table or view. Then it calls OpenRowset in
the validation phase to retrieve column metadata, and later in the execution phase to read out
the data.
o Testing has shown that using a SELECT statement can be at least an order of
magnitude faster, because the adapter issues the specified command directly through
the provider and fetches the data using sp_prepare without executing the command,
avoiding the extra roundtrip and a possibly inappropriate cached query plan.
Optimize source queries - using traditional query optimization techniques optimize the source
adapter SQL query. SSIS doesn't optimize the query on your behalf, but passes it on verbatim.
Set based operations - when possible, perform set based operations in SQL Server. For
example, SQL Server can generally sort faster than the sort transform, especially if the table
being sorted is indexed. Set based operations such as joins, unions, and selects with ORDER
BY and GROUP BY tend to be faster on the server.
If you know that data in a source is sorted, set IsSorted=TRUE on the source adapter output.
This may save unnecessary SORTs later in the pipeline which can be expensive. Setting this
value does not perform a sort operation; it only indicates that the data it sorted.
If you need a dynamic SQL statement in an OLE DB Source component, set
AccessMode="SQL Command from variable" and build the SQL statement in a variable that
has EvaluateAsExpression=TRUE.
If you can, filter your data in the Source Adapter rather than filter the data using a Conditional
Split transform component. This will make your data flow perform quicker.
Break out all tasks requiring the Jet engine (Excel or Access data sources) into their own
packages that do nothing but that data flow task. Load the data into Staging tables if
necessary. This will ensure that solutions can be migrated to 64bit with no rework as there is no
64bit Jet driver.
Flat File and Other File Sources
Retrieving data from file sources presents its own set of performance challenges because the data is
typically in some format that requires conversion. For example, the Jet Provider only supports a limited
set of data types when reading Excel files and flat file data are always of type string until converted.
Here are a few hints on how to eliminate unnecessary data flow work:
Combine Adjacent Unneeded Flat File Columns – to eliminate unnecessary parsing
Leave Unneeded Flat File Columns as Strings – don't convert them to dates etc. unless
absolutely necessary
Eliminate Hidden Operations – mostly the Dataflow Task is explicit about what it is doing.
However, there are some components that perform hidden or automatic conversions. For
example, the Flat File Source Adapter will attempt to convert external column types to their
associated output column types. Use the Advanced Editor to explore each column type so that
you know where such conversions occur.
Only Parse or Convert Columns When Necessary – Reorganize the data flow to eliminate the
Type Conversion Transform if possible. Even better, if possible, modify the source column data
type to match the type needed in the data flow.
Use the FastParse Option in Flat File Source – Fastparse is a set of optimized parsing routines
that replace some of the SSIS locale-specific parsing functions.
Eliminate Unneeded Logging – logging is useful for debugging and troubleshooting but, when
deploying completed packages to production, be mindful and careful about the log entries you
leave enabled and the log provider you use. Notably, OnPipelineRowsSent is somewhat
verbose.
When using raw files and your Raw File Source Component and Raw File Destination
Component are in the same package, configure your Raw File Source and Raw File
Destination to get the name of the raw file from a variable. This will avoid hardcoding the name
of the raw file into the two seperate components and running the risk that one may change and
not the other.
Variables that contain the name of a raw file should be set using an expression. This will allow
you to include the value returned from System::PackageName in the raw file name. This will
allow you to easily identify which package a raw file is to be used by.
o Note: This approach will only work if the Raw File Source Component and Raw File
Destination Component are in the same package.
Data Flow Transformations
Be mindful of transforms with internal File IO - Some of the stock dataflow transforms perform
internal file Input/Output. For example, the Raw Source and Destination, Import/Export Column
Transforms, Flat File Source and Destination and Excel File Source and Destination Adapters
are all directly impacted by the performance of the file system. File IO isn't always a bottleneck,
but when combined with low memory conditions, causing spooling or with other disk intense
operations, it can significantly impact performance. Components that read and write to disk
should be scrutinized carefully, and if possible, configured to read and write to dedicated hard
drives. Look for ways to optimize the performance of the hard drives using RAID,
defragmentation and/or correct partitioning.
Monitor memory intensive transforms - if your package is memory bound, look for ways to
eliminate the memory intensive transforms or shift them to another package. Some transforms
such as the Aggregate, Lookup and Sort use a lot of memory. The Sort, for example, holds all
buffers until the last buffer and then releases the sorted rows. If memory runs low, these
transforms may spool to disk, causing expensive hard page faults.
o Monitor other memory intensive applications - when running on the same machine as
other memory intensive applications, the data flow can become memory starved, even
if there is plenty of memory on the machine. This is typically true when running
packages on the same machine with SQL Server. SQL Server is aggressive about
using memory. You can use the sp_configure system stored procedure instruct SQL
Server to limit its memory usage.
Lookup Transformation
Pare down the Lookup reference data - the default lookup query for the Lookup Transform is
SELECT * FROM TABLE(S). Instead, select the option to use the results of a query for the
reference data. Generally, the reference data should only contain the key and the desired
lookup column. So, for a dimension table lookup, that would be the natural key and the
surrogate key.
Use Lookup Partial or Full Cache mode - depending on the requirements and the data, you
should choose one of these two modes to speed up the Lookup. Partial cache mode is useful
when the incoming data is repetitive and only references a small percentage of the total
reference table. Full cache mode is useful when the reference table is relatively small and the
incoming data references the full spectrum of reference table rows.
LOOKUP components will generally work quicker than MERGE JOIN components where the 2
can be used for the same task (Case study:
http://blogs.conchango.com/jamiethomson/archive/2005/10/21/2289.aspx)
Slowly Changing Dimension (SCD) Transformation
It is interesting that SSIS does provide an out-of-the-box solution for SCDs however for most
practical purposes, the performance is extremely poor.
Some other cons of this transform are:
o It can be slow when applied to large dimensions, i.e. anything over 1,000 rows.
o You can’t configure the implied comparison/lookup between the two tables. It is strictly
a .NET comparison for equality. That means, for example, that a varchar(10) may not
compare correctly to a char(10), even though to the human eye they are identical.
o If the results of any part of the comparison results in a NULL, the transformation throws
an exception, which may or may not cause the entire package to abort. It depends on
how you have “programmed” the package.
o It can’t handle any “special cases”. Like updating Oracle tables.
An alternative approach to using SSIS’s SCD transformation would be to consider using an
open source solution such as the Kimbal Method SSIS SCD Component
(http://www.codeplex.com/kimballscd).
Another alternative approach would be to use existing SSIS components and build a solution
for the SCD types as shown below:
Data Flow Destination
Other online links would urge you to use the DataReader Destination as opposed to the OLE
DB Destination for performance improvements. However, experience suggests that no credible
improvement is seen and one should use the OLE DB Destination as the flexibility it provides to
move between environments and servers outweighs the negligible performance degradation.
Set the Commit Size - The Commit Size option allows you to set a larger buffer commit size for
loading into SQL Server. This setting is only available in the OLEDB Destination when using
the SQL Server OLEDB driver. A setting of zero indicates that the adapter should attempt to
commit all rows in a single batch.
Turn on Table Lock - This option is available in the OLEDB Destination Editor. Selecting "Table
Lock" also enables fast load, which tells the adapter to use the IRowsetFastload bulk insert
interface for loading.
o NOTE: Fastload delivers much better performance, however it does not provide as
much information if there is an error. Generally, for development you should turn it off
and then turn it on when deploying to production.
Disable Constraints - This option is also available in the OLEDB Destination Editor by
unchecking the "Check constraints" option.
Trash Destination
During development, it is quite useful to have a destination that consumes rows but does not
necessarily provide any functionality. There is an adapter developed by Konesans (along with many
other useful adapters), that does exactly that. Using it with a Data Viewer task allows you to monitor
rows that could have been potentially rejected by the previous component in the data flow.
You can download the adapter here: http://www.sqlis.com/post/Trash-Destination-Adapter.aspx
References
Links
http://www.sqlis.com/post/Easy-Package-Configuration.aspx
http://www.sqlis.com/post/Trash-Destination-Adapter.aspx
http://www.simple-talk.com/sql/sql-server-2005/sql-server-2005-ssis-tuning-the-dataflow-task/
http://www.whitworth.org/2009/04/14/todd-mcdermids-blog-an-alternative-to-ssiss-scd-wizard/