Why do we need sequential identity in data warehousing?
Currently Fabric doesnt support Identity columns. Even if/when it does, I assume that the numbers won’t be sequential as most MPPs such as SqlPools generate different ranges per “node” as per Synapse, resulting in numbers that may be unique but look somewhat random.
https://azurede.com/2021/07/31/comparing-sql-server-and-synapse-analytics-identity-columns/
First up, its “generally” best to give up on having sequential numbers as keys in an MPP. In order to have guaranteed sequential numbers its implicit that we would need to serialise so two processes do not get the same number at the same time, or only ever run one query at a time!
Second up, read the first up. You really shouldn’t be trying to cling onto sequential identity. However, this can come up with migrations or business processes that are insistent on relying on incremental numbers for example.
So if you still want to get sequential identity to work on Fabric and return its value, read on. Otherwise dont event try!
Naïve Methods in Fabric
Say we are inserting a singleton, then you may see examples like this online
DROP TABLE IF EXISTS [dbo].[Lineage]
CREATE TABLE [dbo].[Lineage] (
LineageKey int NOT NULL
, StartDateTime datetime2(2) NULL
, OtherFields varchar(50) null
);
INSERT INTO [dbo].[Lineage] (LineageKey, StartDateTime )
SELECT coalesce(MAX(LineageKey),0) + 1 as LingeageKey,GETDATE()
FROM [dbo].[Lineage];
SELECT MAX(LineageKey) as LineageKey FROM [dbo].[Lineage] order by LineageKey DESC;
Another method may look like this
DECLARE @Lineagekey INT
SELECT @Lineagekey = coalesce(MAX(LineageKey),0) + 1 FROM [dbo].[Lineage];
INSERT INTO [dbo].[Lineage] (LineageKey, StartDateTime )
SELECT @Lineagekey,GETDATE()
SELECT @Lineagekey as LineageKey
Both of these don’t really work when we have concurrent ETL or processes.
On the traditional SqlDbEngine we could fix this by serialisation via transactions, or a locking hint like TABLOCK, or READPAST. However none of these work with Fabric. Even transactions are difficult as Fabric uses snapshot isolation which would allow 2 processes to read the same values at the same time.
Working Method
So, to get a sequential number we need to serialise work and “block” the second or subsequent process until the first has finished. The code below use a table as a semaphore to prevent concurrent access. Surely there is a better way. Do post if you have one!
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRAN
DROP TABLE IF EXISTS tmp.LineageKey
SELECT coalesce(MAX(LineageKey),0) +1 as LineageKey
INTO tmp.LineageKey
FROM [dbo].[Lineage]
INSERT INTO [dbo].[Lineage] (LineageKey, StartDateTime )
SELECT LineageKey, GETDATE()
FROM tmp.LineageKey
SELECT LineageKey FROM tmp.LineageKey
COMMIT TRAN
Conclusions
Obviously this is a nasty workaround to be avoided and we wait to see new features that come out in Fabric. Also, these sort of tables work so much better in row store SMP like SQLDB or SqlDbEngine.
If only we could add a SQLDB into Fabric 😉