It’s becoming increasingly common to want to import data from files stored in a data lake into Power BI. What’s the best way of doing this though? There are a bewildering number of options with different performance and cost characteristics and I don’t think anyone knows which one to choose. As a result I have decided to do some testing of my own and publish the results here in a series of posts.
Today’s question: is it better to connect direct to files stored in ADLSgen2 using Power BI’s native ADLSgen2 connector or use Azure Synapse Analytics Serverless to connect instead? Specifically, I’m only interested in testing the scenario where you’re reading all the data from these files and not doing any transformations (which is a subject for another post).
To test this I uploaded nine csv files containing almost 8 million rows of random data to an ADLSgen2 container:
First of all I tested loading the data from just the first of these files, NewBasketDataGenerator.csv, into a single Power BI table. In both cases – using the ADLSgen2 connector and using Synapse Serverless via a view – the refresh took on average 14 seconds.
Conclusion #1: importing data from a single csv file into a single Power BI table performs about the same whether you use the ADLSgen2 connector or go via Synapse Serverless.
Next, I tested loading all of the sample data from all of the files into a single table in Power BI. Using the native Power BI ADLSgen2 connector the Power Query Editor created the set of queries you’d expect when combining files from multiple sources:
Here are the columns in the output table:
Using a Power BI PPU workspace in the same Azure region as the ADLSgen2 container it took an average of 65 seconds to load in the Power BI Service.
I then created a view in an Azure Synapse Serverless workspace on the same files (see here for details) and connected to it from a new Power BI dataset via the Synapse connector. Refreshing this dataset in the same PPU workspace in Power BI took an average of 72 seconds.
Conclusion #2: importing data from multiple files in ADLSgen2 into a single table in Power BI is slightly faster using Power BI’s native ADLSgen2 connector than using Azure Synapse Serverless
…which, to be honest, seems obvious – why would putting an extra layer in the architecture make things faster?
Next, I tested loading the same nine files into nine separate tables into a Power BI dataset and again compared the performance of the two connectors. This time the dataset using the native ADLSgen2 connector took on average 45 seconds and the Azure Synapse Serverless approach took 40 seconds on average.
Conclusion #3: importing data from multiple files in ADLSgen2 into multiple tables in Power BI may be slightly faster using Azure Synapse Serverless than using the native ADLSgen2 connector
Why is this? I’m not completely sure, but it could be something to do with Synapse itself or (more likely) Power BI’s Synapse connector. In any case, I’m not sure the difference in performance is significant enough to justify the use of Synapse in this case, at least on performance grounds, even if it is ridiculously cheap.
Not a particularly interesting conclusion in this case I admit. But what about file format: is Parquet faster than CSV for example? What about all those options in the Power BI ADLSgen2 connector? What if I do some transformations? Stay tuned…