Real move on Data Lake using Rest API on ADF

There are many times where I’ve needed to move a file using Azure’s Data Factory and found it weird that the only native option provided is to copy the file, then delete it. Many reading know that disk level renames are actually almost instantaneous (just a rewrite the the files metadata / partition tree).

  • Costs can mount up if you are using a shared IR. Azure Data Factory is not cheap for data movement with some customers paying thousands of euro a month for simple data movement and copying
  • Moving a 10GB file by copying and then deleting it in ADF takes on average 92 seconds, where’s a move is a simple logical operation taking no more than 9.5 seconds (on average
  • It just doesn’t make sense to copy and delete in 2022
  • A graph showing these static time improvements:

A sample of the ADF json code can be found at this Dev Ops repo here:

https://dev.azure.com/prodata-irl/Demos/_git/adf-examples?path=/adf/pipeline/Move-Blob.json

The Path API for Datalake supports any operation that can be done using Storage Explorer on its endpoint. As Datalakes have long supported the option to rename files – we can do this with the API. When renaming a file and its directory, this is considered a move operation and theoretically should take the same length of time for a file of any size, as you are only modifying the partition table / metadata of the file and not moving, copying or modifying the contents of the file itself on disk.

Most of the information here was taken from this Microsoft learn article:

https://learn.microsoft.com/en-us/rest/api/storageservices/datalakestoragegen2/path/create

To move a file using REST, you need to call the create api endpoint – which is accessed by sending a PUT request to the storage account with the directory and file.

PUT https://{accountName}.{dnsSuffix}/{filesystem}/{path}

This request with no headers and correct authentication will create a file at that destination.

To engage in a move, the ‘x-ms-rename-source’ request header needs to be sent, along with the file path, as seen below.

x-ms-rename-source="container/directory/sourcefile.dat"

To do this in ADF we use the Web shape.

The url consists of the template link pasted below:

https://{accountName}.{dnsSuffix}/{filesystem}/{path}

The filesystem (container) and path are the destination, and the x-ms-rename-source is the source file.

Using the Data Factory’s System Assigned Managed Identity for authentication with the resource path set as:

https://{accountName}.{dnsSuffix}/

Below is a sampleobn

URL:

https://moveblobtestingsa.dfs.core.windows.net/tutorial/data/Transactions.csv

Resource:

https://moveblobtestingsa.dfs.core.windows.net/

Headers:

x-ms-rename-source: destination/data_destination/Transactions.csv

Making this shape dynamic introduces some issues – as the destination file is encoded in the URL – using parameterized values adds some complexity. Characters like spaces and URL specific values (&%=) must be encoded, but directories can’t have their path divider (/) encoded.

https://@{pipeline().parameters.StorageAccountName}.dfs.core.windows.net/@{pipeline().parameters.SinkContainer}@{if(equals(first(pipeline().parameters.SinkDirectory), '/'), '', '/')}@{replace(uriComponent(pipeline().parameters.SinkDirectory),'%2F','/')}@{if(equals(last(pipeline().parameters.SinkDirectory), '/'), '', '/')}@{uriComponent(pipeline().parameters.SinkFile)}

And here is the header:

@{  
    concat(
        '/',
        pipeline().parameters.SourceContainer,
        if(
            equals(
                first(pipeline().parameters.SourceDirectory),
                '/'
            ),
            '',
            '/'
        ), 
        pipeline().parameters.SourceDirectory,
        if(
            equals(
                last(pipeline().parameters.SourceDirectory),
                '/'
            ),
            '',
            '/'
        ),
        pipeline().parameters.SourceFile)
}

Leave a Reply