Data Lake Storage to Synapse Analytics using Managed Service Identity (MSI) – COPY INTO and PolyBase

Why to use MSI?

  • Easy to authenticate any Azure Active Directory supported service
  • Provides limited visibility of the credentials
  • More secured than SQL authentication and less susceptible to hacking

Configuration steps

  1. Create a storage account and enable Hierarchical namespace for Data Lake Storage Gen2.
  2. Within storage account, create a container for file system .
  3. Navigate to Access Control to assign at least Storage Blob Data Contributor role to the user or synapse workspace to enable managed Identity.
  4. Log in to dedicated sql pool through SSMS to get started with data loading using COPY INTO and Polybase.

Pre Requisites

  • Log in to SQL Dedicated Pool through SSMS
  • Create a SQL table in the database (e.g. stg.Transactions)

How to use COPY INTO

COPY INTO [stg].[Transactions](FinanceKey,DateKey,OrganizationKey,DepartmentGroupKey,ScenarioKey,AccountKey,Amount,Date)
FROM 'https://demodatalake.dfs.core.windows.net/democontainer/transactions/transactions.csv'
WITH (
	FILE_TYPE = 'CSV'
	,ROWTERMINATOR = '0x0A'
	,FirstRow = 2
	,FIELDTERMINATOR = ','
	,CREDENTIAL = (IDENTITY = 'Managed Identity')
)

File Type specifies format of external data such as CSV, PARQUET, ORC. Authentication can also be done using methods such as Shared Access Signature, Service Principals, Storage Account Key and Azure Active Directory. To know more about different parameters for COPY INTO click here.

How to use PolyBase

Polybase has same function as COPY INTO whereas it involves certain steps to be performed before actually data load.

  1. Create a database scoped credential (other authenticated methods can be used such as SAS, Account Key, etc)
CREATE DATABASE SCOPED CREDENTIAL demoCred
WITH IDENTITY = 'Managed Identity'

2. Create external data source using scoped credential, make sure to use “abfss” scheme for managed identity.

CREATE EXTERNAL DATA SOURCE demoExtDS
WITH (
	TYPE = HADOOP,
	LOCATION = 'abfss://democontainer@demodls.dfs.core.windows.net',
	CREDENTIAL = demoCred
)

3. Generate an external file format that specifies the source files format such as CSV as shown below.

CREATE EXTERNAL FILE FORMAT [scsvFile] 
WITH (
	FORMAT_TYPE = DELIMITEDTEXT,
	FORMAT_OPTIONS (
		FIELD_TERMINATOR = N',',
		STRING_DELIMITER = N'"',
		FIRST_ROW=2,
		USE_TYPE_DEFAULT=False
	)
)

4. Create an external table that references data stored in data lake storage.

CREATE EXTERNAL TABLE [ext].[Transactions_EXT]
(
	[FinanceKey] [int] NOT NULL
   ,[DateKey] [int] NOT NULL
   ,[OrganizationKey] [int] NOT NULL
   ,[DepartmentGroupKey] [int] NOT NULL
   ,[ScenarioKey] [int] NOT NULL
   ,[AccountKey] [int] NOT NULL
   ,[Amount] [float] NOT NULL
   ,[Date] [datetime] NOT NULL
)
WITH (
	DATA_SOURCE = [demoExtDS],
	LOCATION = N'transactions/transactions.csv',
	FILE_FORMAT = scsvFile,
	REJECT_TYPE = VALUE,
	REJECT_VALUE = 0
)

6. Run SQL statements to load data from external tables to SQL dedicated pool.

INSERT INTO stg.Transactions (FinanceKey,DateKey,OrganizationKey,DepartmentGroupKey,ScenarioKey,AccountKey,Amount,Date)
SELECT *,1 as LineageKey
FROM ext.Transactions_EXT
OPTION (LABEL = 'TEST-74E34C3D')

If you are feeling fancy and doesn’t want to perform all the configuration steps manually, use the power of PowerShell that does everything for you with one click.

PowerShell Script to generate data lake and assign permissions

Import-Module -Name Az -Force

$username = "[AzureUsername]"
$password = "[Password]" 
$secPwd = ConvertTo-SecureString $password -AsPlainText -Force
$myCred = New-Object System.Management.Automation.PSCredential($username,$secPwd)
 
$subscription = "[subscription name]"
$SubscriptionId = "[subscription Id]"
$resourceGroup = "[resource group name]"
 
Connect-AzAccount -SubscriptionName $subscription -Credential $myCred

$dataLakeStorageName = "demodls"
#New-AzStorageAccount -ResourceGroupName $resourceGroup -Name $dataLakeStorageName -Location "North Europe" -SkuName "Standard_LRS" -EnableHierarchicalNamespace 1 

$ctx = (Get-AzStorageAccount -ResourceGroupName $resourceGroup -Name $dataLakeStorageName) | Select "Context"

$containerName = "demo-container"
#New-AzStorageContainer -Context $ctx.Context -Name $containerName

$dirname = "demo-kitty/"
#New-AzDataLakeGen2Item -Context $ctx.Context -FileSystem $containerName -Path $dirname -Directory

$localSrcFile =  "[path to source file to be loaded into data lake storage]"

$destPath = $dirname + (Get-Item $localSrcFile).Name
#New-AzDataLakeGen2Item -Context $ctx.Context -FileSystem $containerName -Path $destPath -Source $localSrcFile -Force 

$scopeName = "/subscriptions/" +  $subscriptionId + "/resourceGroups/" + $resourceGroup + "/providers/Microsoft.Storage/storageAccounts/" + $dataLakeStorageName
$scopeName.ToString()
New-AzRoleAssignment -ObjectId "[data lake storage object Id]" -RoleDefinitionName "Storage Blob Data Contributor" -Scope $scopeName

For more details please read Microsoft KB here

Leave a Reply