Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless, Part 2: Transformations

In my last post I showed how importing all the data from a folder of csv files stored in ADLSgen2 without doing any transformations performed about the same whether you use Power Query’s native ADLSgen2 connector or use Azure Synapse Serverless. After publishing that post, several people made the same point: there is likely to be a big difference if you do some transformations while importing.

So, using the same data I used in my last post, I did some more testing.

First of all I added an extra step to the original queries to add a filter on the TransDate column so only the rows for 1/1/2015 were returned. Once the datasets were published to the Power BI Service I refreshed them and timed how long the refresh took. The dataset using the ADLSgen2 connector took on average 27 seconds to refresh; the dataset connected to Azure Synapse Serverless took on average 15 seconds.

Next I removed the step with the filter and replaced it with a group by operation, grouping by TransDate and adding a column that counts the number of rows per date. The dataset using the ADLSgen2 connector took on average 28 seconds to refresh; the dataset using Azure Synapse Serverless took on average 15 seconds.

I chose both of these transformations because I guessed they would both fold nicely back to Synapse Serverless, and the test results suggest that I was right. What about transformations where query folding won’t happen with Synapse Serverless?

The final test I did was to remove the step with the group by and then add the following transformations: Capitalize Each Word (which is almost always guaranteed to stop query folding in Power Query) on the GuestId column then split the resulting column in to two separate columns at character position 5. The dataset using the ADLSgen2 connector took on average 99 seconds to refresh; the dataset using Synapse Serverless took on average 137 seconds. I have no idea why this was so much slower than the ADLSgen2 connector but it’s a very interesting result.

A lot more testing is needed here on different transformations and different data volumes but nevertheless I think it’s fair to say the following: if you are doing transformations while importing data into Power BI and you know query folding can take place then using Synapse Serverless as a source may perform a lot better than the native ADLSgen2 connector; however if no query folding is taking place then Synapse Serverless may perform a lot worse than the ADLSgen2 connector. Given that some steps in a Power Query query may fold while others may not, and given that it’s often the most expensive transformations (like filters and group bys) that will fold to Synapse Serverless, then more often than not Synapse Serverless will give you better performance while importing.

3 responses

  1. Thanks Chris for testing the performance. But regarding the money, as I understand, for the ADLSgen2 we don’t have to pay for the extra cost (under Pro license), but we have to pay for Azure Synapse Analytics right? Thanks!

  2. Pingback: Chris Webb's BI Blog: Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless Chris Webb's BI Blog

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: