Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless, Part 2: Transformations

In my last post I showed how importing all the data from a folder of csv files stored in ADLSgen2 without doing any transformations performed about the same whether you use Power Query’s native ADLSgen2 connector or use Azure Synapse Serverless. After publishing that post, several people made the same point: there is likely to be a big difference if you do some transformations while importing.

So, using the same data I used in my last post, I did some more testing.

First of all I added an extra step to the original queries to add a filter on the TransDate column so only the rows for 1/1/2015 were returned. Once the datasets were published to the Power BI Service I refreshed them and timed how long the refresh took. The dataset using the ADLSgen2 connector took on average 27 seconds to refresh; the dataset connected to Azure Synapse Serverless took on average 15 seconds.

Next I removed the step with the filter and replaced it with a group by operation, grouping by TransDate and adding a column that counts the number of rows per date. The dataset using the ADLSgen2 connector took on average 28 seconds to refresh; the dataset using Azure Synapse Serverless took on average 15 seconds.

I chose both of these transformations because I guessed they would both fold nicely back to Synapse Serverless, and the test results suggest that I was right. What about transformations where query folding won’t happen with Synapse Serverless?

The final test I did was to remove the step with the group by and then add the following transformations: Capitalize Each Word (which is almost always guaranteed to stop query folding in Power Query) on the GuestId column then split the resulting column in to two separate columns at character position 5. The dataset using the ADLSgen2 connector took on average 99 seconds to refresh; the dataset using Synapse Serverless took on average 137 seconds. I have no idea why this was so much slower than the ADLSgen2 connector but it’s a very interesting result.

A lot more testing is needed here on different transformations and different data volumes but nevertheless I think it’s fair to say the following: if you are doing transformations while importing data into Power BI and you know query folding can take place then using Synapse Serverless as a source may perform a lot better than the native ADLSgen2 connector; however if no query folding is taking place then Synapse Serverless may perform a lot worse than the ADLSgen2 connector. Given that some steps in a Power Query query may fold while others may not, and given that it’s often the most expensive transformations (like filters and group bys) that will fold to Synapse Serverless, then more often than not Synapse Serverless will give you better performance while importing.

Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless

It’s becoming increasingly common to want to import data from files stored in a data lake into Power BI. What’s the best way of doing this though? There are a bewildering number of options with different performance and cost characteristics and I don’t think anyone knows which one to choose. As a result I have decided to do some testing of my own and publish the results here in a series of posts.

Today’s question: is it better to connect direct to files stored in ADLSgen2 using Power BI’s native ADLSgen2 connector or use Azure Synapse Analytics Serverless to connect instead? Specifically, I’m only interested in testing the scenario where you’re reading all the data from these files and not doing any transformations (which is a subject for another post).

To test this I uploaded nine csv files containing almost 8 million rows of random data to an ADLSgen2 container:

First of all I tested loading the data from just the first of these files, NewBasketDataGenerator.csv, into a single Power BI table. In both cases – using the ADLSgen2 connector and using Synapse Serverless via a view – the refresh took on average 14 seconds.

Conclusion #1: importing data from a single csv file into a single Power BI table performs about the same whether you use the ADLSgen2 connector or go via Synapse Serverless.

Next, I tested loading all of the sample data from all of the files into a single table in Power BI. Using the native Power BI ADLSgen2 connector the Power Query Editor created the set of queries you’d expect when combining files from multiple sources:

Here are the columns in the output table:

Using a Power BI PPU workspace in the same Azure region as the ADLSgen2 container it took an average of 65 seconds to load in the Power BI Service.

I then created a view in an Azure Synapse Serverless workspace on the same files (see here for details) and connected to it from a new Power BI dataset via the Synapse connector. Refreshing this dataset in the same PPU workspace in Power BI took an average of 72 seconds.

Conclusion #2: importing data from multiple files in ADLSgen2 into a single table in Power BI is slightly faster using Power BI’s native ADLSgen2 connector than using Azure Synapse Serverless

…which, to be honest, seems obvious – why would putting an extra layer in the architecture make things faster?

Next, I tested loading the same nine files into nine separate tables into a Power BI dataset and again compared the performance of the two connectors. This time the dataset using the native ADLSgen2 connector took on average 45 seconds and the Azure Synapse Serverless approach took 40 seconds on average.

Conclusion #3: importing data from multiple files in ADLSgen2 into multiple tables in Power BI may be slightly faster using Azure Synapse Serverless than using the native ADLSgen2 connector

Why is this? I’m not completely sure, but it could be something to do with Synapse itself or (more likely) Power BI’s Synapse connector. In any case, I’m not sure the difference in performance is significant enough to justify the use of Synapse in this case, at least on performance grounds, even if it is ridiculously cheap.

Not a particularly interesting conclusion in this case I admit. But what about file format: is Parquet faster than CSV for example? What about all those options in the Power BI ADLSgen2 connector? What if I do some transformations? Stay tuned…

Read part 2 of this series here

New Option To Solve Problems With Power Query Not Loading All Data From An Excel Worksheet

Some interesting new Power Query functionality was released in the December release of Power BI Desktop (it’s also in the build of Excel I’m running) that didn’t get announced in the blog post but which will be useful for anyone using Excel as a data source. It is fully documented but I thought I’d blog about it anyway to raise awareness.

It’s a new option on the Excel.Workbook function called InferSheetDimensions. Here’s what the docs for the Excel.Workbook function say:

Can be null or a logical (true/false) value indicating whether the area of a worksheet that contains data should be inferred by reading the worksheet itself, rather than by reading the dimensions metadata from the file. This can be useful in cases where the dimensions metadata is incorrect. Note that this option is only supported for Open XML Excel files, not for legacy Excel files. Default: false.

There’s also a much more detailed explanation on this page:

…in the section “Missing or incomplete Excel data” on how to troubleshoot issues where Power Query isn’t loading all the data from your worksheet.

Here’s a quick summary of the problem. Open XML format Excel workbooks (ie xlsx and xlsm format files) have some metadata (the “dimensions metadata” referred to in the docs) in them that acts as a kind of bounding box describing the range of cells in each worksheet that contain values. By default, Power Query uses this metadata when it gets data from an Excel worksheet – although not when it gets data from an Excel table or named range. Most of the time this metadata is reliable but in some cases, for example where an xlsx file has been generated by third-party software that has “export to Excel” functionality, it contains inaccurate information. This doesn’t affect Excel, which still loads the workbook, but it means that Power Query may not load all the data present in a worksheet. In these cases, setting InferSheetDimensions=true in Excel.Workbook causes Power Query to ignore the dimensions metadata and load all the data from the worksheet.

The docs also point out that if the dimension metadata returns a range that much larger than the actual populated range of cells in the worksheet, this can cause performance problems when loading data. While this can be fixed by manually editing the workbook as shown here, that’s not much help if you’re dealing with Excel files generated by buggy third-party software. Since Excel is already one of the slowest data sources you can use with Power Query, I guess this is another reason to use text formats like CSV with Power Query instead.

Handling Multiple URL Query Parameters With The Same Name Using Web.Contents In Power Query/Power BI

Some time ago I wrote a pair of popular posts about using the Query and RelativePath options of the Web.Contents function in Power Query and why they are important for dataset refresh. I have recently learned something extra about this subject which merits a new post, though: how to handle multiple URL query parameters with the same name.

In the examples for this post I’ll be using a free, fake web service available at which requires no authentication so you will be able to run the code I show yourself. Let’s start by considering the following call to the comments resource of this API:

In M, you can use the Web.Contents function and the Query option to call the API like so:

//Generates the URL

Now this API – unlike many others – allows you to pass multiple URL query parameters of the same name. For example this is a valid call:

The problem is, if you try to do the following in M:


…you’ll get an error, because the Query option take a record and a record can’t have two fields with the same name.

Here’s where it gets interesting. In the working example above, the postId field in the Query record contains a text value. The field can contain a list of text values instead, though, and this is how you solve the problem:

//Generates the URL

This results in a URL with two query parameters that are both called postId but have the values 1 and 2 respectively.

There’s something else worth mentioning about this too. Sometimes when you’re generating a URL you may not want to add a query parameter to it if a value is null. One way of handling this is to start with an empty record and then add fields to it using the Record.AddField function, but using an empty list provides another approach. For example:

//Generates the URL
//and *not*

…generates a URL without the postId query parameter in. This means you can write a simple function with an optional parameter for postId like this:

(optional myPostId as text) =>
    Web.Contents("",[Query=[postId=myPostId ?? {}]])

If a text value is passed to myPostId then the postId query parameter is added to the URL; if no value is passed, myPostId is null and the ?? null coalescing operator (see Ben Gribaudo’s post here for more information on this operator) can be used to replace it with an empty list.

One last thing to mention: if you need to see the web service calls generated by Web.Contents when you’re testing in the Power Query Editor, you can use the query diagnostics functionality as I show here.

[I’m extremely grateful to Dave Maiden for letting me know about all this – it answers a question I have been asked several times in the past]

Sample HTTP Endpoint For Querying Azure Analysis Services And Power BI Premium

I’ve always thought it would be useful if we could run queries against AAS or Power BI via a simple REST API and I’ve just come across a really nice sample project created by one of my colleagues at Microsoft, David Browne, that allows you to do just that. You can find it here:

From the ReadMe:

This sample is a HTTP proxy for XMLA endpoints, intended for use with Power BI Premium or Azure Analysis Services.

It’s implemented as ASP.NET Core 5 API Project. The main API is /api/Query which allows you to POST a DAX query and receive the results as a JSON result.

The sample is coded to pass-through authentication from the client to the XMLA endpoint. So to call the API either use HTTP BASIC auth over HTTPS, passing credentials with the request. It’s highly advised that this be a Service Principal, rather than an AAD user. To specify a Service Principal use a UserName of the form app:[ClientID]@[TenantID], and pass a Client Secret as the Password.

For better security, instead of passing a UserName/Password using HTTP BASIC auth, fetch a Bearer token for your XMLA endpoint. To fetch a token use the Resource ID for Power BI, or https://* for Azure Analysis Services

I can think of all kinds of cool stuff you could do with this…

%d bloggers like this: