Many folks build star schema data warehouses and the supporting ecosystem of Semantic Models, Business Intelligence and/or ML Analytics. This is a quick note on how to assess how you are with automation, regardless of what tools you are using: SSIS,ADF, SQL, data bricks, Synapse, or non Microsoft tools.
I recently ran a twitter thread and the #1 feature for a DW was agility. Eg being able to quickly get data from source into the star schema or at least in a curated format in data lake speak. I would add to that extenisbilty – there is no point in building someting quickly if you cant change it quickly!
How do smaller Teams Achieve Agility but larger new teams dont ?
How do consultancy/partner houses who develop DWs for a living achive this ? well there two major differences between an experienced team who produces quick succcessful projects and larger companies who hire lots of people but flounder at making things quicky and efficiency without them becomes instant “legacy status”
- They have a good proven architecture and methodology. Discussions on tools and platforms is a 5 minute conversation as they team has done this 10-100 times before and has T shirt sizes. All team members know naming convention, tools, how to code, how to use tools, how requirements are documented, source code, DevOps, and what to do without much discussion. Some team members may come and go, but a senior architect and core members will persist and on board new members.
- They re-use IP between projects to accelerate the process. Lets call this DW Automation tools, although in practise a lot of this could be cut and paste from one customer to another at early maturity. Even that first step gives them a massive boost over the greenfields team.
What scope does DW Automation apply to ?
One thing to clarify is the SCOPE of DW automation tools, for someone just starting out the focus is always injest or extract. Examples are:
- How do we quickly copy 100 files from one SQl to another
- Copying CSVs from data lake to SQL stagign area
- Converting CSVs into views for serverless
- Concerting CSVs into polybase views in Synapse
However, while this is very much the best place to start, it is very much just the “easy pickings” and start of the journey. The more complex parts are:
- Automating the Transform or parts therof
- Automating Orhestration
- Automating the Load into star Schema
- Automating the creation and maintenance of Star Schema
- In some cases automating the semantic model
Eliminating piping and Fat ?
Anyone who hears me tak about DW architecture will hear me talk a bit about three components of data architecture Fat, piping and business logic
- Fat is the part of the solution which are overheads and we under no circumstances want a developer to EVER write other than the first time. Example are: auditing, logging, lineage, monitoring, error handling.
- Piping is data movement which is deterministic and doesnt have bespoke business input. This work is too basic to be paying a senior data engineer to be doing. Examples are: ingest and extract, SCD, merge loading a dimention, doing a sliding window load, de-duping, loading a fact, surrogation.
- Business Logic is the actual coding of requirements. For my projects this is usually a BUS Matrix style with ETL Notes to help with coding and mappings. Examples of this could eb data mapping, calculations and complex transformations.
Clearly we want to use a framework which has a standard fat, automates piping and allows the data engineers to plug in business logic (without writing any fat).
Some commerical tools attempt to use low code or workbenches to fully automate business logic. I’m not so sure on that myself as they end up becoming a development envionment rather than enhancing it and can fall behind the tradiitonal development envirtonment – be that ADF or say databricks.
What do these DWA Frameworks look like ?
a DWA Framework can take many forms
- Simply a methodology based on cut and paste and project experience
- A Set of Templates like a Jupyter Notebooks.
- Some code generation on templates like BIML.
- Meta data based with parametisation like ADF
- Meta data based with dynamic code generation like dynamic TSQL
As an example
- Many houses have this as “loose IP” which they use, but its more a service than a product, and not something customers buy or use without consultancy. I see this in my company but also in other companies like Purple Frog, Advancing Analytics and probably a long list of others!
- Some have semi productised or open sourced it with more rigurous abstration away from specific customers. An example here could be SDU Tools from Greg Low
- Some go full on and attempt to release as a product like TimeXTender or Wherescape
In the next blog I wll look at how to measure a maturity model of where you are with DW Automation, versus where you could be.
Its not necassarily a good move to go up the maturity level. Eg moving from Loose IP to a full commercial product could be the kiss of death for a consultancy practise IMO. Service and products are difficult to mix (or they have been for me!).