This blog shows how to easily Extract Json and XML objects and Arrays into SQDW using COPY INTO
SQLDW is a bit behind other vendors like snowflake when it comes to being able to query Json natively off a data lake as they have native support. To get around this we simply load the Json or Xml into a staging tables first and then we can shred it using OPENJSON.
Note: some vendors send a stream of Json documents, with one object per line with sat a CR/LF after each. We can support this too, but it will need a different RowTerminator.
Extracting Json Files from DataLake
The following sample query reads JSON file into a single row LOB.
COPY INTO [stg].[UsersJson]([Prop_0]) FROM 'https://swatlakedev.dfs.core.windows.net/landing/GenericFiles/json/Users.json' WITH ( FILE_TYPE = 'CSV' ,fieldterminator ='0x0b' ,fieldquote = '0x0b' ,rowterminator = '0x0c' /* Override This if a Json document and not single object */ ,CREDENTIAL=(IDENTITY= 'Managed Identity') )
The trick is we use 0x0b which is a Vertical Tab hopefully not found inside the Json for Field and Quote delimiters and then we use 0x00c which is a Form Feed as terminator so entire file becomes one blob.
Extracting using Bulk Import in ADF
You can also extract using Data Factory (ADF) by using either a CSV dataset or the native Json format (preferred).
using Json Dataset (below)
Using CSV Dataset (below). If using CSV the complexities are:
- Data Factory does not let you use special ASCII characters like 0x0b for RowTerminator, but you can use \u0001 for column delimiter
- While a CSV DataSet supports “No Terminator”, this does not work with the Copy Task. Annoying limitation.
- We need to call the field “Prop_0” if we don’t want to have to do manual schema or mapping work.
We get around this by using characters not found in the Json file like /b (backspace) or “^” but it would be nice if we had better support for using control characters like \u0001 in the Row Terminator.
Dealing with Nested Json in Data Factory
If you have nested Json then you may get the error message “The retrieved type of data JObject is not supported yet. Please either remove the targeted column ‘XXX’ or enable skip incompatible row to skip the issue rows“
You can fix this by clicking the “Map complex values to string” check box.
So how does Prodata do it ?
We use an automation framework so that registering Json objects to bring into staging is just a meta data operation (no actual coding or work in ADF). We then hedge our bets by supporting importing Json via Copy/Polybase or ADF with Copy being the default and others available with a meta data change.