Fabric Wish List for DWA

As you may have seen from some of my blog posts, videos or talks, we put a lot of thought and time into Data Warehouse Automation Frameworks (DWA) and recently completed out V1 of Fabric DWA
https://prodata.ie/fabric-dwa/

During the thousands of hours R&D we came across a lot of things that frustrated us, or could have helped us make a better solution, so I’m going to summarise the top items here.

Anyone want to discuss, feel free to reach out to me. Always happy to chat on Data Warehousing and DWA.

If you agree, please vote on them!

1. Dynamic Pipeline/Notebook Execution

SSIS back in 2005 included the ability to run a sub package by using a variable/expression, but in 2024 we dont have that on the latest platform! This is invaluable for any meta data framework as it lets you develop and add content without having to also update code or control flow.

Let us set the invoked pipeline or notebook name as a variable. It was ETA H2 2023 but never materialised ;-(
https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=2b86b5da-bb24-ee11-a81c-6045bdb6ba8b

2. On-Premise Data Gateway for Fabric

Currently data pipelines and notebooks with Fabric cannot access data that is on-premise and behind a VNet. This means we have to use the legacy ADF or even worse a dataflow to get data from on-premise.

Planned Q1 2024
https://learn.microsoft.com/en-us/fabric/release-plan/data-factory#opdg

3. Identity Data Type

For some customers who have 1,000s of tables with Identity columns on dimensions, this alone is a show stopper to Fabric Migration. Sure there are workarounds for new projects such as hashing or manually trying to use row_number(), but these hacks all have limitations like lack of concurrency greater than 1 or clashes.

Currently no public commitment ;-(
https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=43d94c9b-aa16-ee11-a81c-000d3a047196

4. Fabric SQLDB Integration

Its great that Fabric gives us many ways to write delta-parquet files, but for meta data and logging we really need a different platform. One that has super low latency and optimised for small non-analytical queries.

Ideally, what we need here is to be able to create a Fabric “SQLDB” and link it into a warehouse or Lakehouse to contain meta data.

https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=ef2bed11-d9e6-ee11-a73e-002248525fd9

5. Managed Identity

Currently we cannot authenticate from Fabric to anything else in azure using Managed Identity. Notebooks with PySpark identity as the last user who saved them which doesnt sound very secure and often we need to use a service principal to use say graph API.

We want for the Fabric Workspace to have a Managed Identity so it can be granted access rights within azure and that notebooks can run under this identity.

https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=650a8e47-8ad2-ee11-92bd-6045bdbdea69

6. Custom DNS for SQL Endpoints

We currently have to connect to LH and DW artefacts by using a very cumbersome DNS name like

kkm4vwf6l6zebg4lqrhbtdcmsq-eyv2nzeycu4ulpgoqid75tjkmy.datawarehouse.fabric.microsoft.com

This causes two issues:

  1. We want to use a logical DNS name so we can migrate workspaces and keep the same connection string for clients.
  2. We need to have a quick way to know if a server/workspace is development or production. This helps prevent accidental deployments and mistakes like accidently deleting production data thinking it was dev or test.

https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=c21f8b1f-137e-ee11-a81c-00224854aa33

7. Shared Spark Cluster

When a Spark session is “cold” it could take 3 minutes to spin up and if its a starter pool it can now take 6 seconds.
Many ETL has numerous different notebooks with small python code per notebook, so even this 6 seconds can add a huge amount of time.

A shared persistent spark cluster for entire run would  eliminate the need for users to wait for the startup time of Spark, and also prevents excessive CU consumption caused by launching multiple clusters

https://ideas.fabric.microsoft.com/ideas/idea/?ideaid=f8742aa3-3696-ee11-a81c-000d3a7c2745

Leave a Reply