This blog explores a universal architecture for EDW automation. I’m, attempting to be technology agnostic, so little mention of specifics here. Albeit those who know me, know I am a SQL centric fan boy with a software engineering and Finance background.
There are many different technologies for building data warehouses. Currently I see three emerging platforms:
- The traditional relational DBEngine using a mixture of ETL tools and in-db TSQL (some people call this ELT)
- The scale out SQL platform be that Synapse Dedicated Pools or another vendor like Amazon Redshift, snow flake. These tend to focus more on ELT, but typically still need say ADF for orchestration, ingest and file processing.
- The more compute centric approach of Databricks, delta lake and other tools.
To compound this we have cloud on on-premise solutions, with IaaS VMs as a kind of halfway house way to mix traditional technology with newer PaaS.
My question is can all these platforms and technologies agree a single universal Architecture for EDW Automation. I say yes they can, and to prove that I have used the same architecture now on 3-4 different platforms: SSIS and DBEngine, ADF and SQLDB and ADF and SQL Pools.
So what is this Universal Architecture ?
Assuming that we are loading a star schema we write a framework that has the following components.
The Fat (Enterprise Features)
Logging, Lineage and Error Handling are examples of what we call “fat” its stuff that I never want to ever see a developer writing, but the framework should handle it invisibly.
Its important to note that this means that developers must NOT have to write a single line of code. If you write a TSQL SProc called “usp_StartLogging” or an ADF pipeline to do logging that developers needs to include in their code, then you do not have a framework, you have an API which developers need to use. That is a design pattern and NOT a framework.
An example of this approach could be you have an master ADF pipeline that does logging and then calls child packages and developers only ever write child packages which they register in a meta database. Never touching the parent which is doing the fat, this could be a good way to wrap the fat.
Anything which is just moving data I call “piping”, but this extends to any trivial task so this could cover more complex scenarios like:
- De-duping data
- surrogate keys
- Loading of star schema using different algorithms for dimensions v facts
- Inferred members
For piping we like to create a “library” of different pipelines which can then be called via a parent with meta data to describe which function is used for which file or table. Note that I am using the word pipeline which infers ADF, but a template could be a SProc generating dynamic TSQL or a databricks notebook or even an Azure Function written in Python. Any language or platform that can be parameterised re-used and called within orchestration.
An example of a technology that is not great for piping is SSIS (assuming not using BIML), this is because SSIS data flows have a very hard coded schema, so re-usability is difficult. Sure you could write a dynamic dot.net custom shape, but then are you really using SSIS…
The end result is that analyst or developers just register meta data and parameters and piping “just works” with zero coding. Not even a low code workbench.
Meat – Custom Logic
This is where we want developers to be working. Typically the transform part of ETL or maybe some complex data processing module creating new facts like price variation, costing, allocations, master data merging into a dimensions or basic mapping work between different schemas and casting.
I don’t mind what technology is used here, but we want to avoid any work which is 100% repetitive and move that up to piping. Example of how to apply this could be
- SQL Views for very basic transforms. we call these single step transforms.
- SProcs for more complex transform. We call these multi step.
- ADF data flows for transforms out of DB.
- Databricks for transforms direct on data/delta lake.
You may find that as your DW automation framework matures you manage to automate more and more of the custom logic. This increases agility as time goes on.
Want to see this in Action ?
I have an open source example framework for Synapse SQL Pools and a video in this blog here
Want to Discuss ?
I like to discuss Data Architecture on twitter as there are so many people with different angles. opinions and experiences. Join the discussion @bob_duffy