It’s quite common to have a requirement to generate a SHA1 or MD5 Hash in SSIS to help with comparing non-key columns for “delta” management.
One question that comes up is “What’s the fastest way to generate a Hash”. I just did a little test to compare two methods:
Method a) SQL Servers HashBytes(‘SHA1’, Column1+ coalesce(Column2,’’)) function
Method b) In SSIS Using the MultipleHash Component http://ssismhash.codeplex.com/
The target data set was a 10 million row table based on the Adventure Works Sales.SaleOrderHeader table scaled up on my 64 bit laptop with an 8 core i7 CPU.
|Method||Time (Secs)||Avg %Processor Time||DTEXEC Working Set (MB)|
|Method A – SQL HashBytes function on source||104||50||47|
|Method B – SSIS MultipleHash Component||644||107||156|
Not sure if anyone else has got better performance from generating the Hash within SSIS as opposed to SQL Server but my conclusion so far is:
- SQL Server HashBytes function is a lot faster than doing the Hash calculation in the Dataflow (x6 in my test).
- There are other factors to choose which method is appropriate:
- The SSIS Custom Component is easier to use (point and click)
- The SSIS Custom Component allows you to hash columns which are “introduced” in the pipeline
- The SSIS component required the shipping of a DLL (bad)
- The HashBytes function does not support nulls so needs a lot of casting to strings and checking for nulls.
- A checksum function is faster again, but carries more risk of not detecting a change.