Should you use Datatypes on Staging Tables in Fabric ?

When you extract a file into Fabric (LH or DW) Table from a CSV, you can leave columns as the default varchar(8000), or with additional work apply the proper data types.

Some questions we seek to answer

Does the correct data types improve performance, given that storage is delta parquet and not a SQL format ?
Is it worth the effort ?
How would we do it ?
Any best practices ?

In this post we are exploring the above using a python notebook to load a 7GB CSV file containing sales data with 44 columns and then a SQL end point to query it. The CSV used a format which comes from common retail POS system, although data content was generated.

We tested four different scenarios. Each using PySpark but with variations as below.

Test Scenarios

1. Default PySpark with no data types

In this style of code we make no effort at all to establish data types when loading a CSV

file_path = "Files/Test/TestCSV6.csv"
df = spark.read.format("csv").option("header", True).load(file_path)
df.write.mode("overwrite").format("delta").saveAsTable("testcsv4s2a")

The result is a data table with 44 columns all varchar (8000)

Can I say “yuck” 😉

2. PySpark dataframe using a STRUCT

In this style of code we use a STRUCT in the notebook to give the correct data type for each column. With 44 columns this is a fair bit of work.

file_path = "Files/Test/TestCSV6.csv"

table_schema = StructType([\
        StructField("GUESTCHECKLINEITEMID", LongType(), True),\
        StructField("ORGANIZATIONID", ByteType(), True),\
        StructField("LOCATIONID", IntegerType(), True),\
        StructField("REVENUECENTERID", LongType(), True),\
        StructField("ORDERTYPEID", LongType(), True),\
        StructField("BUSINESSDATE", DateType(), True),\
        StructField("FIXEDPERIOD", ByteType(), True),\
        StructField("TRANSDATETIME", TimestampType(), True),\
        StructField("POSTRANSREF", LongType(), True),\
        StructField("SERVICEROUNDNUM", ByteType(), True),\
        StructField("LINENUM", ByteType(), True),\
        StructField("SEATNUM", ByteType(), True),\
        StructField("DETAILTYPE", ByteType(), True),\
        StructField("RECORDID", ByteType(), True),\
        StructField("PRICELEVEL", ByteType(), True),\
        StructField("UWSID", IntegerType(), True),\
        StructField("CHECKEMPLOYEEID", LongType(), True),\
        StructField("TRANSEMPLOYEEID", LongType(), True),\
        StructField("MANAGEREMPLOYEEID", LongType(), True),\
        StructField("STATUS", StringType(), True),\
        StructField("BINARYSTATUS", BooleanType(), True),\
        StructField("VOIDFLAG", BooleanType(), True),\
        StructField("GENFLAG1", BooleanType(), True),\
        StructField("REASONCODE", ByteType(), True),\
        StructField("LINECOUNT", ByteType(), True),\
        StructField("LINETOTAL", DecimalType(), True),\
        StructField("REPORTLINECOUNT", ByteType(), True),\
        StructField("REPORTLINETOTAL", DecimalType(), True),\
        StructField("REFERENCEINFO", StringType(), True),\
        StructField("MOVEFLAG", BooleanType(), True),\
        StructField("DONOTSHOW", BooleanType(), True),\
        StructField("DAYPARTID", IntegerType(), True),\
        StructField("PRICEOVRDEFLAG", BooleanType(), True),\
        StructField("TAXEXEMPTFLAG", BooleanType(), True),\
        StructField("ERRORCORRECTFLAG", BooleanType(), True),\
        StructField("REASONCODEID", LongType(), True),\
        StructField("TAX1TOTAL", DecimalType(), True),\
        StructField("MAJORGROUPID", LongType(), True),\
        StructField("FAMILYGROUPID", LongType(), True),\
        StructField("DTLID", ByteType(), True),\
        StructField("ACTIVETAXES", StringType(), True),\
        StructField("ADJUSTDATETIME", TimestampType(), True),\
        StructField("TAX1POSREF", StringType(), True)\
    ])


df = spark.read.format("csv").schema(table_schema).load(file_path)
df.write.mode("overwrite").format("delta").saveAsTable("testcsv4ss")

The end result is a very strongly typed staging table

3. PySpark using inferSchema

In this style of code we simply use the InferSchema option of spark. This will do a second full pass over the data and recommend/guess data types

file_path = "Files/Test/TestCSV6.csv"
df = spark.read.format("csv").option("header", True).option("inferSchema",True).load(file_path)
df.write.mode("overwrite").format("delta").saveAsTable("testcsv5s")

The end result here is a table which has some data types, albeit a limited set and potentially these could change if the data changed, but the additional effort is almost none, just an extra option of “InferSchema”

We also notice that the infer uses the float data type which traditionally we avoid in favour of money and decimal for precision reasons.

we can see in the spark log that the InferSchema does indeed result in a full second pass over the data and roughly will increase overall time by 33% if we consider the below

1 unit of work to read into initial data frame
1 unit of work to InferSchema
2 units of work to write. Writing is about 2x slower than reading.

4. PySpark using insertInto

In this style of code we assume the table is created separately and then use insertInto to insert data into columns with pre-defined data types. Columns are inserted based on ordinal values AFAIK.

file_path = "Files/Test/TestCSV6.csv"
df = spark.read.format("csv").option("header", True).load(file_path)
df.write.insertInto("testcsv5s")

We have never actually used this approach on any POCs or actual customer projects yet as it would need you to also manage truncate or deletion and of course the create process.

Test Results

We ran the same CSV through all four scenarios to test query performance and results are below

Conclusions and Best Practices

For us we generally need to have some form of data typing, otherwise we get a whole can of worms when dates and numbers go into strings.

The InferSchema is significantly faster and easier than applying a STRUCT within the notebook. However it isn’t exactly guaranteed, so this could result in functional issues

In most cases we would recommend (and use) the InferSchema and go for the faster/easier option.

In the future we may be exploring the use of insertInto. This does have the benefit of only creating the schema once and not having to then rescan the data types on every single file read.

One hybrid option is inferSchema if the table doesnt exist and then insertInto if it does, for the best of both worlds.

Drop us a note on twitter if you have any other insights into best practices here.
https://twitter.com/bob_duffy

The Perfect Extract Solution for Automation

As you may know we put a lot of work into Fabric automation and trying to get the perfect balance of agility, extensibility with more details below.

So far, I think the design we will choose for that is

Use Infer Schema for when table is not present by default
Use insertInto for when table is present by default
Have a meta data option to specify DDL for columns if you want “perfect” staging tables and to spend the effort, or if inferSchema gives you functional issues. if this is present default to this method.
Have metadata flag to toggle if you always drop, overwrite, append or truncate with a default of overwrite.
Allow override of all defaults for tuning

This will give a framework that “just works” for 99% of files, but also allows to override the basic behaviour when something more specialist is needed. Sure the template becomes a bit more complex, but as its write once use thousands of times within an automation framework, that doesn’t matter as much.

Fabric DWA Framework

Should you use Datatypes on Staging Tables in Fabric ?

Test Scenarios

1. Default PySpark with no data types

2. PySpark dataframe using a STRUCT

3. PySpark using inferSchema

4. PySpark using insertInto

Test Results

Conclusions and Best Practices

The Perfect Extract Solution for Automation

Like this:

Related

Published by Bob Duffy

Leave a ReplyCancel reply

Test Scenarios

1. Default PySpark with no data types

2. PySpark dataframe using a STRUCT

3. PySpark using inferSchema

4. PySpark using insertInto

Test Results

Conclusions and Best Practices

The Perfect Extract Solution for Automation

Share this:

Like this:

Related

Published by Bob Duffy

Leave a ReplyCancel reply

Discover more from Prodata