Update On The Mass Street Analytics Data Warehouse ETL Framework

Bob Wakefield
Data Driven Perspectives
6 min readFeb 16, 2022

--

Several months ago, I released version 1.0 of the Mass Street Analytics Data Warehouse ETL Framework. This is a framework for moving data into and out of an enterprise data warehouse (EDW). It includes everything you need to set up a self-service data analytics platform.

A fundamental principle behind the framework is that you should learn to manually code data warehouse processes. It’s an unpopular opinion, but it’s an opinion formed from 20 years of working in environments where I don’t always have access to the latest and greatest toys.

The framework is more than just an opinionated approach to managing data. It’s a training manual that can take you from “zero to hero”. Even though I hate that phrase, that’s exactly what it does.

While the reception to my work was overwhelmingly positive, there was a small amount of criticism that I ignored at the time.

Additionally, in the intervening time, I have had readers ask me some questions and I think the wider community would benefit from the answers I gave. So, I wanted to take this time to address the criticism head on and share some things that I myself have discovered about my own work.

Can I use the framework in the cloud?

Yes and no.

The framework is currently designed for a medium data use case. You can use the framework for relational databases hosted in the cloud like AWS RDS, Azure SQL, etc.

However, it is currently not designed for the modern data stack. It will be in upcoming versions but, for now, you can’t use it with any of the cutting-edge architectures like data lake, data mesh, data fabric etc. That’s a different use case.

That said, there are WAY more orgs keeping it old school than there are pushing the boundaries of data architecture. In the real world, you’re going to find that chaos and prehistoric tech stacks reign supreme. Not only are many companies NOT in the cloud, but they also don’t even have that much data. I’d go so far as to say for MOST companies that is going to be true.

Additionally, I believe that you should start with the fundamentals and work your way up. Just learning the new hotness without being flexible enough to implement different architectures is extremely limiting. So, start with the old school and work your knowledge forward to present. No matter what you see on social media, the old school is still relevant.

This person’s question was a big revelation for me because I normally work with medium data and hadn’t really considered how to do this with petabyte scale data or complex environments that, despite the relatively small amount of data, a different architecture is still called for due to reasons other than technical challenges. An example would be a large amount of disfunction in an organization making centralized data an impossibility.

Now for the criticisms.

ETL is dead. The good stuff now is TEL or ELT.

In my opinion, you can rearrange those letters all you want. At the end of the day the job is, data is over there, and it needs to be over here. Anything else is semantics. Just because you play a shell game with the letters ETL doesn’t mean you get to throw out things like developing efficient processes both code and policy wise.

You don’t need any of that with stuff like [insert no code low code tool here].

Avoiding developing data pipe with third party software is EXACTLY why I developed the framework in the first bloody place. I explore this deeper in my article The Benefits Of Using Python And T-SQL Over SSIS For ETL.

Suffice to say, the future isn’t No Code.

The future is Whoa! Code!

All the stuff in the document is deprecated.

First of all, no.

Because what’s in the document? Policies, tools, procedures, and sample code surrounding master data management, data ops, dev ops, developing data pipe with agile methodology, statistical process control, and that’s just for starters.

Does any of that sound deprecated to you?

As a matter of fact, data ops is still pretty obscure and cutting edge from my perspective. A lot of other people’s perspective too.

Look into BIML.

Another fundamental concept of the framework is simplicity, ease of use, and maintenance. BIML is 100% NONE of those things. In fact, it’s the total opposite and adds more complexity to the situation than was there when you were just using SSIS.

I’m a purist. I believe you should write things with code and that code should be a simple as possible but no simpler than that.

That said, if there is one framework besides my own that I’d like to recommend it would be dbt. I haven’t dived into it yet but, from what I’ve seen, it aligns with my principals. The only difference is I don’t assume a data lake architecture everywhere I go.

Also, my framework is for developing end to end solutions. dbt is more focused on what is called analytics engineering which is really nothing more than spicy business intelligence which works on the delivery side of data.

I personally think that a data engineer is by their very nature a “full stack” resource that can work front, middle, back, and side to side. I personally don’t see focusing on one aspect of the data pipeline to be long term viable for your career.

dbt is SUPER popular but I think that’s kind of smoke and mirrors. dbt Labs is VERY good at making fetch happen. But keep in mind what I said earlier about most companies keeping it old school.

What you see in the Zeitgeist and media social and otherwise isn’t representative of reality. So, to quote @laurieontech, “…if some technology seems like it’s taking over…it definitely isn’t”.

After over 20 years in tech, I can confirm that hype ALWAYS goes away while the fundamentals of good software engineering remain.

The Future Of The Framework

The framework is ever evolving as I learn more and technology advances. For sure, I’m going to update the framework to work with the modern data stack. That’s a no brainer. However, I won’t be deprecating older concepts.

I’ll be teaching you to use the old and the new in conjunction because the old and the new exists simultaneously in the real world. For every data lake out there, there are like 100 mainframes running COBOL that someone wrote in the ‘70s.

On my to do list is also figuring out a comprehensive master data management solution. MDM is hard and scary, and most software packages cost an arm, leg, your first born, and a blood sacrifice. Top of which, these tools tend to be gold plated and filled with unnecessary features. I intend to create a free simple open-source MDM solution that anybody can use.

The MOST exciting thing, however, is EDW Catapult (documentation isn’t that awesome yet).

For YEARS I’ve been saying, “The ETL Framework is not a piece of software with an executable that can be downloaded and installed.” This is totally false!

In an EPIC facepalm moment, I realized that the physical implementation of the framework is really just a bunch of scripts and script templates. That could TOTALLY be implemented as an application. What was I even smoking?!

So EDW Catapult is going to be a CLI in the beginning, but eventually a full blown app with a GUI that will allow you to implement the framework with a minimal of reading how to use the framework.

I intend to keep as much stuff free and open source as possible, so I’ll definitely keep the community updated!

--

--

Living at the intersection between finance, economics, and data science/engineering. Follow me on Twitter! @BobLovesData