How To Optimize Your ETL Environment To Get The Most Out Of SSIS

Bob Wakefield
Data Driven Perspectives
5 min readOct 13, 2020

--

I recently wrote an article titled The Benefits Of Using Python And T-SQL Over SSIS For ETL. It is turning out to be my most popular article ever.

I am an SSIS expert. The number of man hours I have logged on that tool can be measured in years. I know what it can do and what it cannot do. Most importantly, I know what you should not be doing with it.

SSIS has numerous issues that I laboriously point out in my previous article. However, sometimes telling people about the horrors of SSIS can be a bit futile. Sometimes it is irrelevant because the horse has already left the barn. The organization has used SSIS for years and has hundreds of packages running their ETL environment. As a consultant, you just have to adapt and overcome.

If you are going to use SSIS or have no choice but to use SSIS, there are some things you can do to make SSIS easier to work with. If you are a nontechnical manager, trust me. Your engineers will thank you for this.

The goal here is to set up your environment in such a way that nothing has to change in SQL code or your packages as you promote them from dev to test to prod.

Demand That Your Data Engineers Have A High Level Of SQL Skill

90% of the data flow task in SSIS can be done by T-SQL and usually much more efficiently.

The point and click nature of SSIS is deceptive. As I spoke about in the previous article, it really is not true that developing ETL with SSIS is faster or easier than T-SQL. It is certainly not easier to manage the code base.

My SSIS packages are really simple. That is the goal. It is not something to be ashamed of. You open the data flow tab on one of my packages and all you will see is source and target components on the design surface. That is because I am making SQL Server do all the work of joining and transforming data. The SQL Server relational engine is built to do just that.

If you insist that your junior engineers rely on stored procedures instead of data flow task, you will wind up with a code base that is far easier to maintain and SSIS packages that are simpler and far easier to put into production.

If your engineers need to level up, they can take my free T-SQL tutorial Practical T-SQL Pocket Guide For Beginners. Despite the name, people with a few years of data engineering under their belt can still learn from it because it focuses on teaching a T-SQL first philosophy in a data warehouse ETL context.

Be Consistent With Your Database Names

Database names should NOT be different based on the server it is living on. There should not be EDW_DEV, EDW_TEST, EDW_PROD. You laugh, but this is the kind of stuff I have seen data professionals with years of experience do.

The only thing that should be different from box to box is the name of the server.

Once you create the alias, you can now create connections in SSIS that do not have to change as you promote the package from test to prod.

Create A Standard File I/O Directory Structure

When you have to process import files or output files back to users or third parties, there is a directory structure for you to do that. The structure looks like:

[Drive letter]:\InterfaceAndExtractFiles\[Business Unit]\[ETL Process Name]\[In | Out]

You will need to replicate this structure by creating the root InterfaceAndExtractFiles directory. You can name the root folder to whatever makes sense for your organization. The key is, this is the landing pad for all incoming and outgoing files before they hit your edge servers and go out to the wider world.

ETL process names should not be cryptic. They should be as descriptive as possible. When you create a new project in SSDT your project name should match the name of the directory.

Inside of each process directory you create should be an In and Out folder. The In folder should be where you import files from other processes. The Out folder is the result of any processing done in ETL processes where the output is a flat file. You can create whatever file structure necessary to facilitate your process inside of the In/Out folders as long as the base structure mentioned above exist.

Create Global Environment in the Integration Services Catalog

Packages are deployed to the SSIS catalogue using the project deployment model. There are some common variables used across all packages. These variables are stored in an environment named Global. The name value pairs of those variables are as follows. The first pair is the public folder on your edge server were files are exchanged with internal data customers and third parties. The next is the landing pad that is only accessible by warehouse ETL processes.

Variable name: FileExchangeRootDirectory

Value: \\[Server Name]\FileExchange\

Variable name: RootFileDirectory

Value: [Drive Letter]:\InterfaceAndExtractFiles\

Create SQL Server Aliases

Creating server aliases allows you to give the same name to different database servers. That way you reference the alias and the alias on the box will point the package to the proper server without actually having to change your database connection settings.

Create an alias with the following settings:

Name: [Data Warehouse Name]

IP: [Server Name]

This setting makes the assumption that server names are relatively set in stone. If server names change, then it is preferable to use actual IP addresses. Create both 32 and 64bit aliases and make sure TCP/IP is enabled.

ETL Framework

The steps I just outlined are a subset of the broader Mass Street Analytics ETL Framework.

I have worked in organizations without a formal framework for developing ETL. It was a nightmare. Having a standard method of doing everything in place at your organization is a huge time saver and time is money. Just implementing a subset of the ETL Framework should result in a significant reduction of hours charged to the budget for maintenance and deployment of SSIS packages.

--

--

Living at the intersection between finance, economics, and data science/engineering. Follow me on Twitter! @BobLovesData