How The “Maximum Connections Per Data Source” Property On Power BI DirectQuery Datasets Can Affect Report Performance

If you’re working with DirectQuery in Power BI then one of the most important properties you can set on your dataset is the “Maximum connections per data source” property. You can find it on the Published Dataset Settings tab in the Options dialog in Power BI Desktop:

The description of what it does in the guidance documentation is pretty comprehensive:

You can set the maximum number of connections DirectQuery opens for each underlying data source. It controls the number of queries concurrently sent to the data source.

The setting is only enabled when there’s at least one DirectQuery source in the model. The value applies to all DirectQuery sources, and to any new DirectQuery sources added to the model.

Increasing the Maximum Connections per Data Source value ensures more queries (up to the maximum number specified) can be sent to the underlying data source, which is useful when numerous visuals are on a single page, or many users access a report at the same time. Once the maximum number of connections is reached, further queries are queued until a connection becomes available. Increasing this limit does result in more load on the underlying data source, so the setting isn’t guaranteed to improve overall performance.

When the model is published to Power BI, the maximum number of concurrent queries sent to the underlying data source also depends on the environment. Different environments (such as Power BI, Power BI Premium, or Power BI Report Server) each can impose different throughput constraints.

I thought it would be interesting to do some experiments to see how this property behaves, what you see in Profiler (or Log Analytics) when connections are queued up, and how you can find an optimal value for your dataset.

The first thing to mention – and this is something I only realised relatively recently – is that this property applies to DirectQuery on Power BI datasets and Analysis Services as well as traditional DirectQuery to external databases. I’m a lot more comfortable with Power BI than any relational database so I decided to do my testing with a DirectQuery dataset connected back to another Power BI dataset; the behaviour of the feature is the same as with DirectQuery to a relational database.

For my tests I created a simple dataset – let’s call it Dataset A – with not much data but a really inefficient DAX measure on it. I then created a composite model dataset – let’s call this Dataset B – with a DirectQuery connection to Dataset A. Finally I created a report with a Live connection to Dataset B with 25 card visuals on, each of which used the inefficient measure with a different filter. The DAX query for each of these cards, when run on its own through DAX Studio, took around 28 seconds, with almost all that time spent in the Formula Engine. The datasets and reports were published to a PPU workspace and all tests were run in the Power BI Service and not in Power BI Desktop (at the time of writing, things work differently in Desktop – which means you should always test performance of DirectQuery reports in the Service and not in Desktop). I ran Profiler traces using the Query Begin and Query End events on both Dataset A and Dataset B during my tests.

First of all, let’s see what happened when the Maximum Number of Connections property on Dataset B was set to 1. This means that Dataset B is only allowed to have one connection open to Dataset A to run its DirectQuery queries. When the report was run, right at the start the Profiler trace on Dataset B showed 25 Query Begin events indicating all 25 queries for the 25 card visuals were being run in parallel; the Profiler trace on Dataset A showed just 1 Query Begin event:

This is what you would expect: since Dataset B can only use one connection to Dataset A it can only run one query at a time and the other 24 queries have to queue up to wait for the connection. When the first query against Dataset A completed, another one started and so on. Since the maximum length of time that a DAX query can run in the Power BI Service is 225 seconds, after 225 seconds any remaining queries timed out. At that point 8 queries had completed, so 8 cards were rendered, and all the remaining cards showed the timeout error I blogged about here:

At the end, the Profiler trace against Dataset A showed 8 completed Query Begin/End pairs:

While the Profiler Trace against Dataset B showed, after the 25 Query Begin events, 8 Query End events for the successful queries and then 17 Query End events with timeout errors for the unsuccessful queries.

One interesting thing to notice is the durations of the queries. As I said, when run on their own each of these queries took around 28 seconds, and the Profiler trace on Dataset A shows each query taking around 28 seconds. If you look at the successful queries on Dataset B you’ll see that their duration goes up in increments of around 28 seconds: the first takes 29984ms, the second takes 58191ms, the third takes 87236ms and so on until you hit the 225 second timeout limit. This shows that the duration of the queries against Dataset B, the composite model, includes the time waiting to acquire a connection. Notice also that the CPU Time of the queries against Dataset B is minimal because it only includes the CPU used by the query for Dataset B; you have to add the CPU Time to the related queries on Dataset A to get the total CPU Time used by these queries.

The important question is, though, what is the effect of increasing the Maximum Connections Per Data Source property? Increasing it will increase the number of queries run in parallel, but is more paralellism always better? I reran my tests with the property set to 5, 10 (which is the default value and the maximum that can be used for datasets not in Premium capacity) and 30 (which is the maximum value that can be used for datasets in any form of Premium capacity). Here are the results:

Maximum Connections Per Data SourceNumber of Visuals That Render Successfully In 225 Seconds
18
516
1010
308

As you can see increasing the parallelism a little bit helps more than increasing the parallelism a lot, and in this case reducing the value from the default was better than increasing it: overloading your source with a lot of expensive parallel queries is often a bad thing. This test isn’t representative of most real-world reports – you shouldn’t have one visual, let alone 30, with queries that run for as long as 30 seconds and the best way to optimise a report like this would be to display the same data in a smaller number of visuals – but I think it’s a useful illustration of how this property works and how it can affect report performance.

Why Not All SQL Queries Can Be Used As A Source In Power BI DirectQuery Mode

A few days ago I was tagged on a thread on Twitter: my old pal Mim was upset to find that he couldn’t use a SQL query with a Common Table Expression as the source of a table in DirectQuery mode. He worked out why without my help but I thought that it was worth me writing up an explanation for other people who might be struggling with the same problem.

In Power BI DirectQuery mode you can – for most data sources – use your own SQL query as the source for a table in your dataset. For example when connecting to the AdventureWorksDW 2017 sample database in SQL Server using DirectQuery mode, I can use the following SQL query as the source of a table:

SELECT 
FullDateAlternateKey, 
EnglishDayNameOfWeek, 
EnglishMonthName
FROM DimDate

Now, let’s say I use this table in a dataset and build the following report with two slicers and a table, and with a month and a day name selected in the slicers:

The SQL generated by Power BI to populate the table looks like this:

SELECT 
TOP (501) [t0].[FullDateAlternateKey]
FROM 
(
(
SELECT 
FullDateAlternateKey, 
EnglishDayNameOfWeek, 
EnglishMonthName
FROM DimDate
)
)
 AS [t0]
WHERE 
(
(
[t0].[EnglishDayNameOfWeek] = N'Sunday'
)
 AND 
(
[t0].[EnglishMonthName] = N'April'
)
)

GROUP BY [t0].[FullDateAlternateKey],[t0].[EnglishDayNameOfWeek],[t0].[EnglishMonthName]
ORDER BY [t0].[EnglishDayNameOfWeek]
ASC
,[t0].[EnglishMonthName]
ASC
,[t0].[FullDateAlternateKey]
ASC

Notice how the original SQL query I used as the source of the table is used as a subquery in this query. This is how Power BI generates its SQL and this is why there are limitations on the SQL you can use in DirectQuery mode: not all SQL queries that run standalone can be used in subqueries.

Take the following variation on the SQL above for example:

SELECT
FullDateAlternateKey,
EnglishDayNameOfWeek,
EnglishMonthName
FROM DimDate
ORDER BY
EnglishMonthName

It’s the same query as before but with an ORDER BY clause. It works fine in the Power Query Editor but when you go to your report you’ll see the following error:

“Microsoft SQL: The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries and common table expressions unless TOP, OFFSET or FOR XML is also specified”

This is the error message generated by the SQL query that Power BI is trying to run – and as you will have guessed by now, it’s being generated because Power BI is trying to use your query with an ORDER BY clause inside its own subquery. You can’t put an ORDER BY clause in a subquery in TSQL except in the scenarios mentioned in the error message. The same goes for Common Table Expressions.

What can you do to work around this? There is really just one answer: model your data the way Power BI likes it (ie as a star schema) before it gets to Power BI. It’s the now-famous Roche’s Maxim. Even if you aren’t running into this limitation using SQL queries as the source of a table is a bad idea because it makes maintenance difficult and you end up doing any expensive transformations at query time and paying the performance penalty over and over again. And yes, I know, some of you don’t have permissions to create views or get the data modelled correctly in the source and I know it’s tough, but that’s the way it is.

TSQL Table-Valued Functions And Dynamic M Parameters In Power BI DirectQuery

My favourite – and it seems many other people’s favourite – new feature in the February 2022 Power BI Desktop release is support for more datasources (including SQL Server, Azure SQL DB and Synapse) with dynamic M parameters. In my opinion dynamic M parameters are extremely important for anyone planning to use DirectQuery: they give you a lot more control over the SQL that is generated by Power BI and therefore give you a lot more control over query performance.

Teo Lachev has already stolen my thunder and blogged about how the new functionality allows you to use a TSQL stored procedure as the source of a table in DirectQuery mode. In this post I’m going to show you something very similar – but instead of using a stored procedure, I’m going to show a simple example of how to use a TSQL table-valued function, which I think has a slight advantage in terms of ease-of-use.

To start off I created a table-valued function in the Adventure Works 2017 sample database on my local SQL Server which returns some filtered data from the DimDate table:

CREATE FUNCTION [dbo].[udfDates] (
    @day_name varchar(50),
	@month_name varchar(50)
)
RETURNS TABLE
AS
RETURN
    SELECT 
        FullDateAlternateKey, EnglishDayNameOfWeek, EnglishMonthName, CalendarYear
    FROM
        DimDate
    WHERE
        EnglishDayNameOfWeek=@day_name
		and
		EnglishMonthName=@month_name;

Here’s how it can be called in a SQL SELECT statement:

select 
	FullDateAlternateKey,
    EnglishDayNameOfWeek,
    EnglishMonthName,
    CalendarYear
from 
	udfDates('Thursday', 'February')
where
	CalendarYear=2005

Next, I created a Power Query query in my dataset that called this function and returned a table in DirectQuery mode. The great thing about table-valued functions is that they appear in the Navigator pane when you connect to a SQL Server database from the Power Query Editor:

You can only hard-code the values you pass to the function’s parameters at this point but even if you do nothing here you can just return an empty table. After having done this I selected DirectQuery mode:

After this I created two Power Query text parameters, called DayName and MonthName, to hold the month and day names to be passed to the function:

I also created two Import mode queries called DayNames and MonthNames to hold all the valid values for the DayName and MonthName parameters:

The last thing to do in the Power Query Editor was to edit the query calling the function to pass the M parameters to it. Here’s the M code for the query after the modification made to the “Invoked Functiondbo_udfDates1” step:

let
  Source = Sql.Database(
    "localhost", 
    "AdventureWorksDW2017"
  ), 
  dbo_udfDates = Source
    {
      [
        Schema = "dbo", 
        Item   = "udfDates"
      ]
    }
    [Data], 
  #"Invoked Functiondbo_udfDates1"
    = dbo_udfDates(DayName, MonthName)
in
  #"Invoked Functiondbo_udfDates1"

At this point I closed the Power Query Editor and loaded the three tables to my dataset:

Next I bound the EnglishDayNameOfWeek column on the DayNames table to the DayName M parameter and the EnglishMonthName column on the MonthNames table to the MonthName M parameter:

Finally, I built a report with two slicers bound to the two dynamic M parameter columns and a table showing the output of the table-valued function:

Here’s an example of the TSQL generated by Power BI to populate the table in this report:


SELECT 
TOP (501) 
[t0].[FullDateAlternateKey],
[t0].[EnglishDayNameOfWeek],
[t0].[EnglishMonthName]
FROM 
(
(
select [$Table].[FullDateAlternateKey],
    [$Table].[EnglishDayNameOfWeek],
    [$Table].[EnglishMonthName],
    [$Table].[CalendarYear]
from [dbo].[udfDates]('Thursday', 'February') as [$Table]
)
)
 AS [t0]
WHERE 
(
[t0].[CalendarYear] = 2010
)

GROUP BY 
[t0].[FullDateAlternateKey],[t0].[EnglishDayNameOfWeek],
[t0].[EnglishMonthName],[t0].[CalendarYear]
ORDER BY [t0].[FullDateAlternateKey]
ASC
,[t0].[EnglishMonthName]
ASC
,[t0].[CalendarYear]
ASC
,[t0].[EnglishDayNameOfWeek]
ASC
 

I am by no means an expert in writing efficient TSQL so I can’t comment on the pros and cons of table-valued functions, stored procedures or using native SQL queries in Power BI (although the last of these has obvious maintainability issues). Hopefully, though, you can see the possibilities – and if you do get round to using this approach on a project, please let me know how you get on!

Power BI DirectQuery Query Labels For SQL Server-Related Sources

If you’re using Power BI in DirectQuery mode against a SQL Server-related data source (ie SQL Server on-prem, Azure SQL DB or Synapse) you may have noticed a new feature that was released a few weeks ago: query labels that allow you to link a SQL query to the Power BI visual that generated it.

There’s nothing you need to do to enable it – it happens automatically. Here’s an example of a TSQL query generated by a DirectQuery mode dataset with it in:

You can see that an OPTION clause in the query adds a label that contains three GUIDs that identify the dataset, report and visual that the query was generated for. These are the same GUIDs that you’ll see used for this purpose in other places, for example Log Analytics. This post has more details on how you can work out which datasets, reports and and visuals these GUIDs relate to.

As I said, this only works for SQL Server-related sources at the moment, but if you think this would be useful for other sources (and I’m sure you do…) please leave a comment below!

Dynamic M Parameters, Snowflake Native SQL And Paginated Reports

There were a couple of new features and enhancements to existing features in the June 2021 Power BI Desktop release that don’t seem to have much to do with each other but which I think can be combined to do cool things. They are:

  1. The new paginated report visual
  2. Native SQL support in the Snowflake connector
  3. Improvements to dynamic M parameters

Let me give you an example of what I mean…

First of all, let’s start with native SQL support in the Snowflake connector. I deal with a lot of customers who use Snowflake and Power BI together and I know just how much people have wanted this. What does it allow you to do? Well, you have always been able to use the Power Query Editor to transform data coming from Snowflake in either Import mode or DirectQuery mode. Now, though, you can write your own native SQL query and use it as the source for a Power Query query (something that has always been possible with some other connectors, such as the SQL Server connector). Incidentally, this also means that the EnableFolding=true option for Value.NativeQuery that I blogged about recently also now works for Snowflake too.

The main reason you’d want to use a native SQL query when connecting to Snowflake, or indeed any database, is to do something that’s possible in SQL but not in Power Query. One example of this is to use regular expressions to filter data. I have the AdventureWorks DW DimCustomer table loaded into Snowflake and I can use Snowflake’s REGEXP function to filter on the LASTNAME column something like this:

SELECT 
DISTINCT FIRSTNAME, LASTNAME, ENGLISHOCCUPATION 
FROM "AWORKS"."PUBLIC"."DIMCUSTOMER" 
WHERE LASTNAME REGEXP 'To.*'

So that’s useful. I can use a query like this as the source of a table in DirectQuery mode in Power BI, but wouldn’t it be useful if end users of my report could change the regular expression used to filter the data? This is where dynamic M parameters come in. Assuming I have a table of pre-defined regular expressions:

And an M parameter:

…I can write an M query like this that uses the M parameter to return the regular expression used in the WHERE clause of the SQL query:

let
  Source = Value.NativeQuery(
    Snowflake.Databases(
      "mysnowflake.com", 
      "DEMO_WH"
    ){[Name = "AWORKS"]}[Data], 
    "SELECT DISTINCT FIRSTNAME, LASTNAME, ENGLISHOCCUPATION 
    FROM ""AWORKS"".""PUBLIC"".""DIMCUSTOMER"" 
    WHERE LASTNAME REGEXP '"
      & pRegEx
      & "'", 
    null, 
    [EnableFolding = true]
  )
in
  Source

…and then turn this into a dynamic M parameter in the Power BI diagram pane:

…and get a report that does this:

One limitation of dynamic M parameters in regular Power BI reports today is that the values you pass into them have to come from a column somewhere inside your dataset, so all of these values have to be pre-defined. Wouldn’t it be useful if the end user could enter any regular expression that they wanted though? That may not be possible in a regular Power BI report but it is possible with a paginated report, because with paginated reports you can write whatever DAX query you want – and therefore pass any value you want to a dynamic M parameter – and also, in a paginated report, you have the option of creating parameters where the user can enter whatever value they want.

I blogged about how to write DAX queries that contain dynamic M parameters here. Here’s an example of a parameterised DAX query (yes, I know, so many types of parameters…) that takes a regular expression and the name of an occupation and returns a table of customers whose last names match the regular expression and whose occupations match the one entered:

DEFINE
    MPARAMETER pRegEx = @DAXRegExParam
EVALUATE
FILTER (
    Customers,
    'Customers'[ENGLISHOCCUPATION] = @DAXOccupationParam
)

This can be used in a paginated report dataset connected to the Power BI dataset created above (yes, I know, so many types of datasets…) like so:

….which can then be used to build a paginated report that does this:

And of course, with the new paginated report visual, this paginated report can be embedded in a regular Power BI report:

All this is very much a proof-of-concept and not something I would recommend for production (I would be worried about SQL injection attacks for a start). There are more enhancements to these features still to come too. However, I do think it’s interesting to see how these features can be put together now and to imagine how they could be used in the future. What do you think?

Measuring DirectQuery Performance In Power BI

If you have a slow DirectQuery report in Power BI one of the first questions you need to ask is how long the SQL queries that Power BI generates take to run. This is a more complicated question to answer than you might think, though, and in this post I’ll explain why.

I happen to have access to some of the famous New York taxi data in a Snowflake database, and in there is a table with trip data that has 173 million rows that I have a built a Power BI dataset from. The data and the database used are not really important here though – what is important is that it’s DirectQuery and a large-ish amount of data. Here’s a report page with a single table visual on it, showing passenger count aggregated by the hack license field:

It’s slow, but how slow? Here’s what Performance Analyzer shows when I refresh the table:

The DAX query takes 5.4 seconds but the Direct Query time is only 3.3 seconds – and the numbers don’t seem to add up. Here’s what Profiler captures for the same refresh shown in Performance Analyzer:

This shows there’s a gap of 2 seconds between the DirectQuery End event and the Query End event. What if I paste the DAX query into DAX Studio? Here’s what the Server Timings tab shows:

This is a different query execution to the two examples above, both of which show data for the same execution, which explains why the numbers are slightly different here – but again there seems to be an extra second of stuff happening and DAX Studio suggests that it’s in the Formula Engine.

So what is going on? The answer lies in understanding what the DirectQuery End Profiler event actually measures: it’s the amount of time between the Analysis Services engine handing a query over to the Power Query engine and the Analysis Services engine receiving the first row in the resultset back, including the time taken for the Power Query engine to fold the query.

Therefore if it takes a long time to get all the rows in the resultset then that could explain what’s going on here. Unfortunately there’s no way of knowing from Profiler events how long this takes – but there is another way. Going back to Performance Analyzer, if you export the data from it to JSON (by clicking the Export button) and load it into Power Query, you can see more detail about a DirectQuery query execution. Here’s the data from the first execution above:

[There’s a very good paper documenting what’s in the Performance Analyzer JSON file here]

Looking at the record in the metrics column for the Execute Direct Query event you can see the same 3.2 second duration shown above in Profiler. Notice that there are two other metrics here as well: RowsRead, which is the total number of rows returned by the resultset; and DataReadDuration, which is the amount of time to read these rows after the first row has been received plus some other Analysis Services Engine operations such as encoding of column values, joining with unpushed semijoins, projections of aggregations such as Average and saving the resultset to the in-memory cache. In this case the SQL query has returned 43191 rows and this takes 1.95 seconds – which explains the gap between the end of the Execute Direct Query event and the end of the query.

One last question: why this SQL query is returning so many rows when the DAX query is only asking for the top 502 rows?

The reason is that, at the time of writing at least, the Analysis Services engine can only push a top(n) operation down to a DirectQuery SQL query in very simple scenarios where there are no measures and no aggregation involved – and in this case we’re summing up values. As a result, if you’re using DirectQuery mode and have a visual like this that can potentially display a large number of rows and includes a measure or aggregated values, you may end up with slow performance.

[Thanks to Jeffrey Wang for providing the information in this post]

Capturing SQL Queries Generated By A Power BI DirectQuery Dataset

If you’re using DirectQuery mode for one or more tables in your Power BI dataset, the chances are that you will want to see the SQL (or whatever query language your DirectQuery data source uses) that is generated by Power BI when your report is run. If you can view the queries that are run in the tooling for the data source itself, for example, using Extended Events or SQL Server Profiler for SQL Server, then great – but you may not have permissions to do this. The good news is that you can capture the SQL queries in Power BI Desktop too, even though it’s not always obvious how to do so.

For some data sources like SQL Server then Performance Analyzer will give you the SQL queries generated. All you need to do is go to the View tab in the main Power BI Desktop window, click on the Performance Analyzer button to display the Performance Analyzer pane, click on Start Recording and then Refresh Visuals, find the event corresponding to the visual whose queries you want to view, expand it and then click on the “Copy query” link:

This will copy the DAX query generated by your visual to the clipboard; in the case of SQL Server DirectQuery sources you’ll also get the SQL query generated for that DAX query.

However this method does not work for all DirectQuery data sources; for them you’ll need to use the Query Diagnostics functionality in the Power Query Editor. You need to open the Power Query Editor window, go to the Tools tab on the ribbon, click on the Start Diagnostics button, go back to the main Power BI window, refresh your visuals (you can use the Refresh visuals button in Performance Analyzer for this again) and then go back to the Power Query Editor and click the Stop Diagnostics button. When you do this several new Power Query queries will appear which contain diagnostics data. Go to the one that has a name that starts with “Diagnostics_Detailed” and somewhere in there – where exactly depends on the data source – you’ll find the query generated. For example, for a Snowflake data source you’ll see the SQL generated somewhere in the Data Source Query column:

For an Azure Data Explorer DirectQuery data source the KQL query will be in one of the Record values in the Additional Info column:

One thing to watch out for is that you may also see what look like SQL Server TSQL queries, even when you’re not using a data source that can be queried with TSQL. Here’s an example from the Azure Data Explorer example above:

You can ignore these queries: they’re not useful, although they do give you an interesting insight into how DirectQuery mode works behind the scenes.

%d bloggers like this: