Comparing The Performance Of Importing Data Into Power BI From ADLSgen2 Direct And Via Azure Synapse Analytics Serverless

It’s becoming increasingly common to want to import data from files stored in a data lake into Power BI. What’s the best way of doing this though? There are a bewildering number of options with different performance and cost characteristics and I don’t think anyone knows which one to choose. As a result I have decided to do some testing of my own and publish the results here in a series of posts.

Today’s question: is it better to connect direct to files stored in ADLSgen2 using Power BI’s native ADLSgen2 connector or use Azure Synapse Analytics Serverless to connect instead? Specifically, I’m only interested in testing the scenario where you’re reading all the data from these files and not doing any transformations (which is a subject for another post).

To test this I uploaded nine csv files containing almost 8 million rows of random data to an ADLSgen2 container:

First of all I tested loading the data from just the first of these files, NewBasketDataGenerator.csv, into a single Power BI table. In both cases – using the ADLSgen2 connector and using Synapse Serverless via a view – the refresh took on average 14 seconds.

Conclusion #1: importing data from a single csv file into a single Power BI table performs about the same whether you use the ADLSgen2 connector or go via Synapse Serverless.

Next, I tested loading all of the sample data from all of the files into a single table in Power BI. Using the native Power BI ADLSgen2 connector the Power Query Editor created the set of queries you’d expect when combining files from multiple sources:

Here are the columns in the output table:

Using a Power BI PPU workspace in the same Azure region as the ADLSgen2 container it took an average of 65 seconds to load in the Power BI Service.

I then created a view in an Azure Synapse Serverless workspace on the same files (see here for details) and connected to it from a new Power BI dataset via the Synapse connector. Refreshing this dataset in the same PPU workspace in Power BI took an average of 72 seconds.

Conclusion #2: importing data from multiple files in ADLSgen2 into a single table in Power BI is slightly faster using Power BI’s native ADLSgen2 connector than using Azure Synapse Serverless

…which, to be honest, seems obvious – why would putting an extra layer in the architecture make things faster?

Next, I tested loading the same nine files into nine separate tables into a Power BI dataset and again compared the performance of the two connectors. This time the dataset using the native ADLSgen2 connector took on average 45 seconds and the Azure Synapse Serverless approach took 40 seconds on average.

Conclusion #3: importing data from multiple files in ADLSgen2 into multiple tables in Power BI may be slightly faster using Azure Synapse Serverless than using the native ADLSgen2 connector

Why is this? I’m not completely sure, but it could be something to do with Synapse itself or (more likely) Power BI’s Synapse connector. In any case, I’m not sure the difference in performance is significant enough to justify the use of Synapse in this case, at least on performance grounds, even if it is ridiculously cheap.

Not a particularly interesting conclusion in this case I admit. But what about file format: is Parquet faster than CSV for example? What about all those options in the Power BI ADLSgen2 connector? What if I do some transformations? Stay tuned…

Read part 2 of this series here

New Option To Solve Problems With Power Query Not Loading All Data From An Excel Worksheet

Some interesting new Power Query functionality was released in the December release of Power BI Desktop (it’s also in the build of Excel I’m running) that didn’t get announced in the blog post but which will be useful for anyone using Excel as a data source. It is fully documented but I thought I’d blog about it anyway to raise awareness.

It’s a new option on the Excel.Workbook function called InferSheetDimensions. Here’s what the docs for the Excel.Workbook function say:

Can be null or a logical (true/false) value indicating whether the area of a worksheet that contains data should be inferred by reading the worksheet itself, rather than by reading the dimensions metadata from the file. This can be useful in cases where the dimensions metadata is incorrect. Note that this option is only supported for Open XML Excel files, not for legacy Excel files. Default: false.

There’s also a much more detailed explanation on this page:

https://docs.microsoft.com/en-us/power-query/connectors/excel#troubleshooting

…in the section “Missing or incomplete Excel data” on how to troubleshoot issues where Power Query isn’t loading all the data from your worksheet.

Here’s a quick summary of the problem. Open XML format Excel workbooks (ie xlsx and xlsm format files) have some metadata (the “dimensions metadata” referred to in the docs) in them that acts as a kind of bounding box describing the range of cells in each worksheet that contain values. By default, Power Query uses this metadata when it gets data from an Excel worksheet – although not when it gets data from an Excel table or named range. Most of the time this metadata is reliable but in some cases, for example where an xlsx file has been generated by third-party software that has “export to Excel” functionality, it contains inaccurate information. This doesn’t affect Excel, which still loads the workbook, but it means that Power Query may not load all the data present in a worksheet. In these cases, setting InferSheetDimensions=true in Excel.Workbook causes Power Query to ignore the dimensions metadata and load all the data from the worksheet.

The docs also point out that if the dimension metadata returns a range that much larger than the actual populated range of cells in the worksheet, this can cause performance problems when loading data. While this can be fixed by manually editing the workbook as shown here, that’s not much help if you’re dealing with Excel files generated by buggy third-party software. Since Excel is already one of the slowest data sources you can use with Power Query, I guess this is another reason to use text formats like CSV with Power Query instead.

Handling Multiple URL Query Parameters With The Same Name Using Web.Contents In Power Query/Power BI

Some time ago I wrote a pair of popular posts about using the Query and RelativePath options of the Web.Contents function in Power Query and why they are important for dataset refresh. I have recently learned something extra about this subject which merits a new post, though: how to handle multiple URL query parameters with the same name.

In the examples for this post I’ll be using a free, fake web service available at http://jsonplaceholder.typicode.com/ which requires no authentication so you will be able to run the code I show yourself. Let’s start by considering the following call to the comments resource of this API:

http://jsonplaceholder.typicode.com/comments?postId=1

In M, you can use the Web.Contents function and the Query option to call the API like so:

//Generates the URL http://jsonplaceholder.typicode.com/comments?postId=1
Web.Contents("http://jsonplaceholder.typicode.com/comments",[Query=[postId="1"]])

Now this API – unlike many others – allows you to pass multiple URL query parameters of the same name. For example this is a valid call:

http://jsonplaceholder.typicode.com/comments?postId=1&postId=2

The problem is, if you try to do the following in M:

Web.Contents("http://jsonplaceholder.typicode.com/comments",[Query=[postId="1",postId="2"]])

…you’ll get an error, because the Query option take a record and a record can’t have two fields with the same name.

Here’s where it gets interesting. In the working example above, the postId field in the Query record contains a text value. The field can contain a list of text values instead, though, and this is how you solve the problem:

//Generates the URL http://jsonplaceholder.typicode.com/comments?postId=1&postId=2
Web.Contents("http://jsonplaceholder.typicode.com/comments",[Query=[postId={"1","2"}]])

This results in a URL with two query parameters that are both called postId but have the values 1 and 2 respectively.

There’s something else worth mentioning about this too. Sometimes when you’re generating a URL you may not want to add a query parameter to it if a value is null. One way of handling this is to start with an empty record and then add fields to it using the Record.AddField function, but using an empty list provides another approach. For example:

//Generates the URL http://jsonplaceholder.typicode.com/comments
//and *not* http://jsonplaceholder.typicode.com/comments?postId=
Web.Contents("http://jsonplaceholder.typicode.com/comments",[Query=[postId={}]])

…generates a URL without the postId query parameter in. This means you can write a simple function with an optional parameter for postId like this:

(optional myPostId as text) =>
Json.Document(
    Web.Contents("http://jsonplaceholder.typicode.com/comments",[Query=[postId=myPostId ?? {}]])
    )

If a text value is passed to myPostId then the postId query parameter is added to the URL; if no value is passed, myPostId is null and the ?? null coalescing operator (see Ben Gribaudo’s post here for more information on this operator) can be used to replace it with an empty list.

One last thing to mention: if you need to see the web service calls generated by Web.Contents when you’re testing in the Power Query Editor, you can use the query diagnostics functionality as I show here.

[I’m extremely grateful to Dave Maiden for letting me know about all this – it answers a question I have been asked several times in the past]

Sample HTTP Endpoint For Querying Azure Analysis Services And Power BI Premium

I’ve always thought it would be useful if we could run queries against AAS or Power BI via a simple REST API and I’ve just come across a really nice sample project created by one of my colleagues at Microsoft, David Browne, that allows you to do just that. You can find it here:

https://github.com/microsoft/azure-analysis-services-http-sample

From the ReadMe:

This sample is a HTTP proxy for XMLA endpoints, intended for use with Power BI Premium or Azure Analysis Services.

It’s implemented as ASP.NET Core 5 API Project. The main API is /api/Query which allows you to POST a DAX query and receive the results as a JSON result.

The sample is coded to pass-through authentication from the client to the XMLA endpoint. So to call the API either use HTTP BASIC auth over HTTPS, passing credentials with the request. It’s highly advised that this be a Service Principal, rather than an AAD user. To specify a Service Principal use a UserName of the form app:[ClientID]@[TenantID], and pass a Client Secret as the Password.

For better security, instead of passing a UserName/Password using HTTP BASIC auth, fetch a Bearer token for your XMLA endpoint. To fetch a token use the Resource ID https://analysis.windows.net/powerbi/api for Power BI, or https://*.asazure.windows.net for Azure Analysis Services

I can think of all kinds of cool stuff you could do with this…

Sixteenth Blog Birthday Post: On The Question Of “Why Don’t You Add This One Simple Feature To Power BI?”

Every year, on the anniversary of the first-ever post on this blog, I write a post reflecting on what has happened to me professionally in the past year. While I’m not sure anyone really wants to look back on the events of 2020 there is a topic I’ve been intending to write about for some time that I think has some interesting parallels with the bigger challenges of this year – although I’ll keep my focus on Power BI and let you draw your own conclusions about the rest.

As you may know, I work on the Power BI Customer Advisory Team at Microsoft and a large part of my job involves collecting requirements from the large customers I support. I’m also closely involved with the Power BI community through social media, user groups and my old connections from the MVP programme, and as a result I spend a lot of time talking about the Power BI roadmap and the features we’re working on. One question I get asked all the time is this:

Why don’t you add [insert feature idea here] to Power BI?

It’s sometimes followed up by one or more of the following comments:

It would be so easy for you to do
I can’t believe you haven’t done it already
Power BI is unusable without it
[insert competitor name here] has had this feature for years

…and a real or virtual exasperated sigh.

In many cases I’m able to tell the person asking the question that we are in fact planning to add that particular feature; sometimes I get to tell them that the feature already exists in the product. The rest of the time I make sure I understand the request properly, tell the questioner that we take feedback very seriously (which we do) and then make a note of it so it can be discussed with the relevant program managers. You could say I’m a kind of human https://ideas.powerbi.com/

Why, though, are there so many features missing from Power BI that would be easy to implement/make the product unusable through their absence/are so obvious that all our competitors have them? The short answer is of course that “it’s complicated” but let me try to give you a better explanation based on my own experiences.

You may also know that up until eighteen months ago I worked as an independent consultant and trainer specialising in Power BI and Analysis Services for thirteen years; even before that had I worked with Analysis Services for several years in a variety of jobs. In my pre-Microsoft career I worked on hundreds of Power BI and Analysis Services projects from all over the world and I thought I had seen it all. I was wrong.

Although the job I do on the CAT team today is not so different from my previous job, the customers are much larger and their concerns are very different. Before, I rarely encountered customers who used Power BI Premium; now I work with them all the time. Before, I used to wonder who used features like Bring Your Own Key; now I know why it was so important for us to build it. The point is, of course, that with a product that has as many customers and users as Power BI it’s impossible for any one person’s experiences to be representative of everyone’s. It’s like the parable of the blind men and the elephant.

On the whole I think that on the Power BI team we do a good job of balancing the priorities of all of our different customers but with finite resources we can never make everyone happy. There are so many different “one simple features” that I’m asked about – each of them valid and useful – that it would be impossible to get them all done.

What’s more there are plenty of ‘easy’ features that we’d love to implement but which turn out to be much harder to deliver than you’d think, or which only make sense after something else more complicated has been done, or which have cost or security implications on our side that aren’t immediately obvious. Then there’s all the other work that needs to be done to keep the Power BI Service running, able to handle the ever-increasing load on it, and the work we need to do to interface with other Microsoft products for the benefit of the wider Azure platform. There are many other competing demands on the developers’ time.

So, to go back to where I started, it’s complicated. Simple problems like feature prioritisation aren’t always as simple as they first seem (see also this post by my colleague Matthew Roche). I don’t make decisions about which features do get built in Power BI although I am involved in the discussions; I know it’s a cliché to say this but the people I work with who do make these decisions are some of the smartest that I know and they do a better job of it than I ever could. Mistakes are made sometimes of course; I suppose the only way to establish whether the number of good decisions outweighs the number of bad ones is the success or otherwise of Power BI in the marketplace.

One last thing to say: I don’t want you to get the impression that I get irritated or angry when I’m asked the “Why don’t you add this one simple feature?” question. In fact I love being asked it: it shows how passionate our customers are, it stops us being complacent, and this feedback really does all get stored, crunched, analysed (in Power BI) and used to make a better product. Please do keep asking! I only wanted to explain why you don’t always get the answer I’d like to give you when you do ask it.

Inlining KQL Query Fragments Using Value.NativeQuery() In Power Query

If the title wasn’t enough to warn you, this post is only going to be of interest to M ultra-geeks and people using Power BI with Azure Data Explorer – and I know there aren’t many people in either group. However I thought the feature I’m going to show you in this post is so cool I couldn’t resist blogging about it.

Let’s say I have a table in Azure Data Explorer called Product that looks like this:

Because this table is in Azure Data Explorer I can query it with KQL, and the thing I love most about KQL as a language is the way that each query is built up as a series of piped transformations. So, for example, on the table above I could write the following query:

Product
| where Type=='Berry'
| where FruitID >1

…and get this result:

It seems quite similar to the way M works, right? Anyway, now for the cool part. When you connect Power Query in Power BI up to Azure Data Explorer, not only does query folding take place but you can mix and match native KQL fragments (added using the Value.NativeQuery M function) and Power Query transformations created in M, and it all still folds! This completely blew my mind when I found out about it. It’s documented briefly here but the docs don’t explain everything that you can do with it.

Consider the following M query:

let
  Source
    = AzureDataExplorer.Contents(
        "https://xyz.ukwest.kusto.windows.net/", 
        "MyDatabase", 
        "Product"
      ),
  #"Filtered Rows"
    = Table.SelectRows(
        Source, 
        each ([Type] = "Berry")
      ),
  Q1 = Value.NativeQuery(
      #"Filtered Rows", 
      "| where FruitID>2"
    )
in
  Q1

There are three steps:

  1. Source connects to the Product table
  2. #”Filtered Rows” is a filter generated by the Power Query UI to filter the table down to just the rows where Type is “Berry”
  3. Q1 uses Value.NativeQuery to add a KQL filter onto the output of #”Filtered Rows” so only the rows where FruitID is greater than 2

Here’s the KQL this folds to:

Product
| where strcmp(["Type"], ("Berry")) == 0
| where FruitID>2

You can also use Value.NativeQuery more than once. Here’s another M query that returns the same result as above:

let
  Source
    = AzureDataExplorer.Contents(
        "https://xyz.ukwest.kusto.windows.net/", 
        "MyDatabase", 
        "Product"
      ),
  Q3 = Value.NativeQuery(
      Source, 
      "| where Type=='Berry'"
    ),
  Q4 = Value.NativeQuery(
      Q3, 
      "| where FruitID>2"
    )
in
  Q4

Very useful when you want to build a query quickly using the Power Query Editor but then need to use a KQL feature that isn’t supported by Power Query.

Using Small Multiples In Power BI To Improve Report Performance

While the long-awaited small multiples feature that previewed in the December 2020 release is an obvious boost to Power BI’s data visualisation capabilities, did you know that you can use it to improve report performance too?

Earlier this year I wrote blog posts showing how you can improve report performance by showing the same amount of data in fewer visuals (for example by replacing several cards with a single table) and how the number of visuals on a page affects report performance even if they aren’t displaying any data; several other people have written similar posts too. Small multiples are just another way you can replace several visuals with a single visual that displays the same data.

To illustrate, consider the following report with five separate line chart visuals on it that are identical apart from the fact that there is a different filter set on each one:

Here’s what Performance Analyzer shows when the page is refreshed:

In this case everything is fairly quick, but notice that each DAX query takes 10-12ms and by the time we have reached the “Count Of Sales by Date for Terraced” visual the total render duration has reached 710ms.

Now, here’s the same data in a single line chart visual using small multiples:

It’s the same data and the same charts, but look at what Performance Analyzer shows now:

There are two things to point out:

  1. There is only one DAX query which, at 12ms, performs about the same as each of the five DAX queries in the previous version of the report. In this case by requesting all the data in a single query, rather than five separate queries, Power BI has been able to optimise how it retieves the data it needs. This doesn’t mean that from a DAX point of view that the small multiples version of the report is five times faster than the original because Power BI will have run the five queries in parallel in that version, but in general you will see some improvement in overall performance from this consolidation of queries and in some cases this can be quite significant.
  2. While the sum of the visual display durations for each of the separate visuals is basically the same as the visual display duration for the small multiples visual – which makes sense because they display the same data in the same way – the total duration of the small multiples visual is 486ms compared to 710ms for the total duration of the Count Of Sales by Date for Terraced visual in the original version, so there has been a definite overall improvement in rendering time. In fact, Performance Analyzer doesn’t really give you an accurate way of measuring the overall time taken to render a report page. A much better technique is the one I blogged about here, and this suggests the overall performance saving from using small multiples is almost 500ms.

In conclusion, then, if you have any groups of visuals on your reports that can be replaced by a single small multiples visual then I recommend that you do so – you may see an improvement in performance as a result. Remember also that it’s still early days for small multiples: as more and more features are added to it, and as more and more visuals support small multiples, the more opportunity you will have to consolidate visuals.

Capturing SQL Queries Generated By A Power BI DirectQuery Dataset

If you’re using DirectQuery mode for one or more tables in your Power BI dataset, the chances are that you will want to see the SQL (or whatever query language your DirectQuery data source uses) that is generated by Power BI when your report is run. If you can view the queries that are run in the tooling for the data source itself, for example, using Extended Events or SQL Server Profiler for SQL Server, then great – but you may not have permissions to do this. The good news is that you can capture the SQL queries in Power BI Desktop too, even though it’s not always obvious how to do so.

For some data sources like SQL Server then Performance Analyzer will give you the SQL queries generated. All you need to do is go to the View tab in the main Power BI Desktop window, click on the Performance Analyzer button to display the Performance Analyzer pane, click on Start Recording and then Refresh Visuals, find the event corresponding to the visual whose queries you want to view, expand it and then click on the “Copy query” link:

This will copy the DAX query generated by your visual to the clipboard; in the case of SQL Server DirectQuery sources you’ll also get the SQL query generated for that DAX query.

However this method does not work for all DirectQuery data sources; for them you’ll need to use the Query Diagnostics functionality in the Power Query Editor. You need to open the Power Query Editor window, go to the Tools tab on the ribbon, click on the Start Diagnostics button, go back to the main Power BI window, refresh your visuals (you can use the Refresh visuals button in Performance Analyzer for this again) and then go back to the Power Query Editor and click the Stop Diagnostics button. When you do this several new Power Query queries will appear which contain diagnostics data. Go to the one that has a name that starts with “Diagnostics_Detailed” and somewhere in there – where exactly depends on the data source – you’ll find the query generated. For example, for a Snowflake data source you’ll see the SQL generated somewhere in the Data Source Query column:

For an Azure Data Explorer DirectQuery data source the KQL query will be in one of the Record values in the Additional Info column:

One thing to watch out for is that you may also see what look like SQL Server TSQL queries, even when you’re not using a data source that can be queried with TSQL. Here’s an example from the Azure Data Explorer example above:

You can ignore these queries: they’re not useful, although they do give you an interesting insight into how DirectQuery mode works behind the scenes.

Using Power BI Dynamic M Parameters In DAX Queries

As I’m sure you’ve guessed by now I’m a big fan of dynamic M parameters in Power BI. They’re easy to use in Power BI Desktop but what if you want to use them in your own DAX queries? Documentation for this is coming soon, but in the meantime I thought it would be useful to show the additions to DAX query syntax to support them – something you can see for yourself if you take a look at the DAX queries generated by Power BI Desktop using Performance Analyzer.

Here’s an example of a query generated by Power BI Desktop where there are three dynamic M parameters defined: DateParameter, TextParameter and NumericParameter.

DEFINE
  MPARAMETER DateParameter = 
    DATE(2020, 1, 1)

  MPARAMETER TextParameter = 
    "January"

  MPARAMETER NumericParameter = 
    1

  VAR __DS0FilterTable = 
    TREATAS({DATE(2020, 1, 1)}, 'ParamValues'[DateP])

  VAR __DS0FilterTable2 = 
    TREATAS({"January"}, 'ParamValues'[MonthNameP])

  VAR __DS0FilterTable3 = 
    TREATAS({1}, 'ParamValues'[MonthNoP])

  VAR __DS0Core = 
    CALCULATETABLE(
      DISTINCT('DATE'[DATE]),
      KEEPFILTERS(__DS0FilterTable),
      KEEPFILTERS(__DS0FilterTable2),
      KEEPFILTERS(__DS0FilterTable3)
    )

  VAR __DS0PrimaryWindowed = 
    TOPN(501, __DS0Core, 'DATE'[DATE], 1)

EVALUATE
  __DS0PrimaryWindowed

ORDER BY
  'DATE'[DATE]

The dynamic M parameters are set using a DEFINE statement and the new MParameter keyword. The name of the parameter here is the same as the name of the parameter defined in the Power Query Editor; one thing to point out is that if your M parameter name contains a space, it should be surrounded by single quotes.

If you have enabled multi-select on your dynamic M parameter, you will need to pass a table of values to it in your DAX query. Here’s an example of how to do this with a table constructor:

DEFINE
  MPARAMETER 'MonthParameter' = 
    {"January",
      "February"}

I’ll admit I haven’t tried this yet, but you should be able to do some really cool stuff with this in a paginated report connected to a DirectQuery dataset if it contains dynamic M parameters. Maybe in a future post…

Session Recording: “Advanced Analytics Features In Power BI”

Recently I had the pleasure of presenting for the team at Planilheiros (whose Portugese-language YouTube channel is almost as popular as Guy In A Cube) at their Power BI Summit. My session was on “Advanced Analytics Features in Power BI” and you can watch it here:

It’s on all the built-in analytics features in Power BI that you can use to make sense of your data – without using any DAX or M. I look at things like adding forecasts to line charts, adding ratio lines and symmetry lines to scatter charts, “explain the increase/decrease” on column charts, the Key Influencers and Decomposition Tree visuals and lots more.

%d bloggers like this: