Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless, Part 2: Transformations

In my last post I showed how importing all the data from a folder of csv files stored in ADLSgen2 without doing any transformations performed about the same whether you use Power Query’s native ADLSgen2 connector or use Azure Synapse Serverless. After publishing that post, several people made the same point: there is likely to be a big difference if you do some transformations while importing.

So, using the same data I used in my last post, I did some more testing.

First of all I added an extra step to the original queries to add a filter on the TransDate column so only the rows for 1/1/2015 were returned. Once the datasets were published to the Power BI Service I refreshed them and timed how long the refresh took. The dataset using the ADLSgen2 connector took on average 27 seconds to refresh; the dataset connected to Azure Synapse Serverless took on average 15 seconds.

Next I removed the step with the filter and replaced it with a group by operation, grouping by TransDate and adding a column that counts the number of rows per date. The dataset using the ADLSgen2 connector took on average 28 seconds to refresh; the dataset using Azure Synapse Serverless took on average 15 seconds.

I chose both of these transformations because I guessed they would both fold nicely back to Synapse Serverless, and the test results suggest that I was right. What about transformations where query folding won’t happen with Synapse Serverless?

The final test I did was to remove the step with the group by and then add the following transformations: Capitalize Each Word (which is almost always guaranteed to stop query folding in Power Query) on the GuestId column then split the resulting column in to two separate columns at character position 5. The dataset using the ADLSgen2 connector took on average 99 seconds to refresh; the dataset using Synapse Serverless took on average 137 seconds. I have no idea why this was so much slower than the ADLSgen2 connector but it’s a very interesting result.

A lot more testing is needed here on different transformations and different data volumes but nevertheless I think it’s fair to say the following: if you are doing transformations while importing data into Power BI and you know query folding can take place then using Synapse Serverless as a source may perform a lot better than the native ADLSgen2 connector; however if no query folding is taking place then Synapse Serverless may perform a lot worse than the ADLSgen2 connector. Given that some steps in a Power Query query may fold while others may not, and given that it’s often the most expensive transformations (like filters and group bys) that will fold to Synapse Serverless, then more often than not Synapse Serverless will give you better performance while importing.

Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless

It’s becoming increasingly common to want to import data from files stored in a data lake into Power BI. What’s the best way of doing this though? There are a bewildering number of options with different performance and cost characteristics and I don’t think anyone knows which one to choose. As a result I have decided to do some testing of my own and publish the results here in a series of posts.

Today’s question: is it better to connect direct to files stored in ADLSgen2 using Power BI’s native ADLSgen2 connector or use Azure Synapse Analytics Serverless to connect instead? Specifically, I’m only interested in testing the scenario where you’re reading all the data from these files and not doing any transformations (which is a subject for another post).

To test this I uploaded nine csv files containing almost 8 million rows of random data to an ADLSgen2 container:

First of all I tested loading the data from just the first of these files, NewBasketDataGenerator.csv, into a single Power BI table. In both cases – using the ADLSgen2 connector and using Synapse Serverless via a view – the refresh took on average 14 seconds.

Conclusion #1: importing data from a single csv file into a single Power BI table performs about the same whether you use the ADLSgen2 connector or go via Synapse Serverless.

Next, I tested loading all of the sample data from all of the files into a single table in Power BI. Using the native Power BI ADLSgen2 connector the Power Query Editor created the set of queries you’d expect when combining files from multiple sources:

Here are the columns in the output table:

Using a Power BI PPU workspace in the same Azure region as the ADLSgen2 container it took an average of 65 seconds to load in the Power BI Service.

I then created a view in an Azure Synapse Serverless workspace on the same files (see here for details) and connected to it from a new Power BI dataset via the Synapse connector. Refreshing this dataset in the same PPU workspace in Power BI took an average of 72 seconds.

Conclusion #2: importing data from multiple files in ADLSgen2 into a single table in Power BI is slightly faster using Power BI’s native ADLSgen2 connector than using Azure Synapse Serverless

…which, to be honest, seems obvious – why would putting an extra layer in the architecture make things faster?

Next, I tested loading the same nine files into nine separate tables into a Power BI dataset and again compared the performance of the two connectors. This time the dataset using the native ADLSgen2 connector took on average 45 seconds and the Azure Synapse Serverless approach took 40 seconds on average.

Conclusion #3: importing data from multiple files in ADLSgen2 into multiple tables in Power BI may be slightly faster using Azure Synapse Serverless than using the native ADLSgen2 connector

Why is this? I’m not completely sure, but it could be something to do with Synapse itself or (more likely) Power BI’s Synapse connector. In any case, I’m not sure the difference in performance is significant enough to justify the use of Synapse in this case, at least on performance grounds, even if it is ridiculously cheap.

Not a particularly interesting conclusion in this case I admit. But what about file format: is Parquet faster than CSV for example? What about all those options in the Power BI ADLSgen2 connector? What if I do some transformations? Stay tuned…

Read part 2 of this series here

%d bloggers like this: