Query Folding And Staging In Fabric Dataflows Gen2

A few years ago I wrote this post on the subject of staging in Fabric Dataflows Gen2. In it I explained what staging is, how you can enable it for a query inside a Dataflow, and discussed the pros and cons of using it. However one thing I never got round to doing until this week is looking at how you can tell if query folding is happening on staged data inside a Dataflow – which turns out to be harder to do than you might think.

Consider the following simple Dataflow consisting of two queries:

The first query, called StageData, reads data from a 6GB, 17 million row CSV file containing open data from the English Prescribing Data dataset. It returns two columns called PRACTICE_NAME and TOTAL_QUANTITY from that CSV file:

Staging is enabled on this query:

The second query, called GroupBy, takes the data returned by the StageData query and does a Group By operation to get the sum of the values in the TOTAL_QUANTITY column for each distinct value in PRACTICE_NAME:

The output of this query was loaded to a Fabric Warehouse:

The scenario is basically the same as the one from my previous post but with a much larger data volume, and the idea was to test again whether it was faster to stage the data and do the Group By on the staged data or to not stage the data and do the Group By while reading the data direct from the source.

It turned out that, once again, staging made performance worse (don’t worry, I have other tests that show it can help performance). But the point about staging is that by loading the data from a query into a hidden Fabric Lakehouse, managed by the Dataflow (which is what is meant by “staging”), any subsequent operations on this data are faster because query folding can take place against the SQL Endpoint of this hidden Lakehouse – and at the time of writing this post there’s no way of knowing from the Dataflows Editor whether query folding is taking place. Right-clicking on the step that does the Group By operation shows that the “View data source query” option is greyed out but this only tells you that you the Editor doesn’t know if folding is taking place:

In fact other things in the UI, such as the query plan and the query folding indicators, suggest incorrectly that folding is not taking place:

So I thought: if query folding is taking place then the Group By will result in a SQL query run against the hidden Lakehouse, so maybe I can see this SQL query somewhere? Unfortunately since the Lakehouse is hidden you can’t get to it through the Fabric web interface. But then I remembered that you can connect to a Fabric workspace using good old SQL Server Management Studio (instructions on how to can be found here). And when I connected using SSMS I could see two hidden objects created by by Dataflow called StagingLakehouseForDataflows and StagingWarehouseForDataflows:

I was then able to run a SQL query using the queryinsights.exec_requests_history DMV against StagingWarehouseForDataflows, filtered for the time range when my Dataflow refresh was taking place:

SELECT start_time,statement_type, command, total_elapsed_time_ms
FROM
queryinsights.exec_requests_history
WHERE
start_time>'insert DF refresh start time here'
AND end_time<'insert DF refresh end time here'
ORDER BY start_time desc

…and saw the following INSERT INTO statement that did the Group By operation I was expecting to see along with how long it took:

insert into
[StagingWarehouseForDataflows_20260223144925].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_2930fd3c_002D2a62_002D4518_002Dafbf_002D249e7af54403]
([Column1], [Column2])
select [_].[Column1] as [Column1],
convert(float, [_].[Total Quantity]) as [Column2]
from (
select [rows].[Column1] as [Column1],
sum([rows].[Column2]) as [Total Quantity]
from [StagingLakehouseForDataflows_20260223144911].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_179186be_002D367d_002D4924_002Da8ba_002Dd1f220415e3a]
as [rows]
group by [Column1] )
as [_]

So, a useful tip if you’re performance tuning a Dataflow even if it’s a bit of a pain to do. Hopefully in the future we’ll be able to see the SQL generated when query folding takes place against a staged table.

[Thanks to Miguel Escobar for his help with this]

When Can Partitioned Compute Help Improve Fabric Dataflow Performance?

Partitioned Compute is a new feature in Fabric Dataflows that allows you to run certain operations inside a Dataflow query in parallel and therefore improve performance. While UI support is limited at the moment it can be used in any Dataflow by adding a single line of fairly simple M code and checking a box in the Options dialog. But as with a lot of performance optimisation features (and this is particularly true of Dataflows) it can sometimes result in worse performance rather than better performance – you need to know how and when to use it. And so, in order to understand when this feature should and shouldn’t be used, I decided to do some tests and share the results here.

For my tests I created two queries within a single Dataflow Gen2 CICD. First, an M function called SlowFunction that takes a numeric value and returns that value with 1 added to it after a two second delay:

(input as number) as number =>
Function.InvokeAfter(()=>input+1, #duration(0,0,0,2))

Then the main query which returns a table of ten rows and calls the SlowFunction M function once per row:

let
Rows = List.Transform({1 .. 10}, each {_}),
MyTable = #table(type table [RowNumber = number], Rows),
#"Added custom" = Table.TransformColumnTypes(
Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])),
{{"FunctionOutput", Int64.Type}}
)
in
#"Added custom"

Here’s what the output of the query looks like:

Now, the first important question. How long does this Dataflow take to run? 10 rows calling a function that takes 2 seconds to run, so 10*2=20 seconds maybe? The answer is yes if you look at how long the preview takes to populate in the Dataflow Editor:

That’s just the preview though. If you’re refreshing a Dataflow there are other things that happen that affect how long that refresh takes, such as Staging and loading the data to a destination. There’s no way you can split the performance of your M code from these operations when looking at the duration of a Dataflow refresh in Recent Runs, which explains why some of the timings you will see later in this post seem strange. Don’t worry, though, it doesn’t stop you from seeing the important trends. I’m told that setting a CSV file in a Lakehouse as your data destination is the best way of minimising the impact of loading data on overall refresh durations but at the time of writing the CSV destination can’t be used with Partitioned Compute so all my tests used a Fabric Warehouse as a destination.

Here’s what Recent Runs showed when this Dataflow was refreshed:

The overall refresh time was 59 seconds; the query (called NonPartitioned here) that returns the table of ten rows and which was staged took 29 seconds.

Could this Dataflow benefit from Partitioned Compute? With Partitioned Compute enabled in the Options dialog, I added the necessary M code to the query:

let
Rows = List.Transform({1 .. 10}, each {_}),
MyTable = #table(type table [RowNumber = number], Rows),
#"Added custom" = Table.TransformColumnTypes(
Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])),
{{"FunctionOutput", Int64.Type}}
),
ReplacePartitionKey = Table.ReplacePartitionKey(#"Added custom", {"RowNumber"})
in
ReplacePartitionKey

…and then refreshed. Here’s what Recent Runs showed:

The overall refresh duration went up to 1 minute 25 seconds; the query that does all the work (called Partitioned in this case) took 40 seconds. Note the screenshot immediately above shows the engine used is “PartitionedCompute” and that there are now ten activities listed instead of one: my M code used the RowNumber column in the table as the partition key so the Dataflow attempted to run each row of the table as a separate operation in parallel. And as you can see, this made performance worse. This is because using Partitioned Compute introduces yet another overhead and that overhead is much greater than any benefit gained from parallelism in this case.

So I wondered: what if the delay in the query is increased from 2 second to 100 seconds then? Does this increase in the delay mean that parallelism results in faster overall performance?

Here’s what Recent Runs showed for a version of my Dataflow with a 100 second delay for each row and which didn’t use Partitioned Compute:

10 rows * 100 seconds = 1000 seconds = 16 minutes 40 seconds, so it’s not surprising that the overall duration of this version of the Dataflow was slow at 17 minutes 29 seconds.

Here’s what Recent Runs shows for the version of this Dataflow that did use Partitioned Compute:

The overall duration was 4 minutes 41 seconds and the main query took 3 minutes 14 seconds. The important takeaway is that this is a lot faster than the version that didn’t use Partitioned Compute, so clearly Partitioned Compute made a big difference to performance here. As you might expect, it looks like parallelising operations that only take a few seconds results in worse performance while parallelising operations that take longer, say a minute or more, is probably a good idea. As always, you’ll need to test to see what benefits you get for your Dataflows.

These results raise a lot of questions too. 100 seconds = 1 minute 40 seconds, which is a lot less than 3 minutes 14 seconds. Does this mean that not every row in the table was evaluated in parallel? Is partitioning on the RowNumber column counter-productive and would it be better to partition in some other way to try to reduce the amount of attempted parallelism? Is there something else that is limiting the amount of parallelism? While this version of the Dataflow always performs better than the non-partitioned version, performance did vary a lot between refreshes. While these tests show how useful Partitioned Compute can be for slow Dataflows, there’s a lot more research to do and a lot more blog posts to write.

Measuring Power BI Report Page Load Times

If you’re performance tuning a Power BI report the most important thing you need to measure – and the thing your users certainly care about most – is how long it takes for a report page to load. Yet this isn’t something that is available anywhere in Power BI Desktop or in the Service (though you can use browser dev tools to do this) and developers often concentrate on tuning just the individual DAX queries generated by the report instead. Usually that’s all you need to do but running multiple DAX queries concurrently can affect the performance of each one, and there are other factors (for example geocoding in map visuals or displaying images) that affect report performance so if you do not look at overall page render times then you might miss them. In this post I’ll show you how you can measure report page load times, and the times taken for other forms of report interaction, using Performance Analyzer in the Service and Power Query.

Consider the following series of interactions with a published Power BI report:

The report itself isn’t really that important – just know that there are a series of interactions with a slowish report while Performance Analyzer is running. Here’s what Performance Analyzer shows by the end of these interactions:

Here’s a list of the interactions captured:

  • I changed from a blank report page to a page with a table visual, where the table visual was cached and displayed immediately
  • I then refreshed the table visual on that page by clicking the Refresh Visuals button in the Performance Analyzer pane
  • I changed to the next page in the report and all the visuals on that page rendered
  • I changed the slicer on that new page
  • I clicked on the bar chart to cross-filter the rest of the page

As you can see from the screenshot above, Performance Analyzer tells you how long each visual takes to render within each interaction but it doesn’t tell you how long each interaction took in total. In a lot of cases you can assume that the time taken for an interaction is the same as the time taken for the slowest visual to render, but that may not always be true.

So how can you use Performance Analyzer to measure the time taken for these interactions? How can you measure the amount of time taken to render a page in a report?

To solve this problem I created a Power Query query that takes the event data JSON file that you can export from Performance Analyzer and returns a table showing the amount of time taken for each interaction. Here’s the M code for this query:

let
Source = Json.Document(File.Contents("C:\PowerBIPerformanceData.json")),
ToTable = Table.FromRecords({Source}),
Events = ToTable{0}[events],
EventTable = Table.FromList(Events, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Expanded Column1" = Table.ExpandRecordColumn(EventTable, "Column1", {"name", "start", "id", "metrics", "end"}, {"name", "start", "id", "metrics", "end"}),
#"Expanded metrics" = Table.ExpandRecordColumn(#"Expanded Column1", "metrics", {"sourceLabel"}, {"sourceLabel"}),
#"Added Custom1" = Table.AddColumn(#"Expanded metrics", "UserActionID", each if [name]="User Action" then [id] else null),
#"Added Custom2" = Table.AddColumn(#"Added Custom1", "UserActionLabel", each if [name]="User Action" then [sourceLabel] else null),
#"Changed Type" = Table.TransformColumnTypes(#"Added Custom2",{{"start", type datetime}, {"end", type datetime}, {"UserActionID", type text}, {"sourceLabel", type text}, {"UserActionLabel", type text}}),
#"Filled Down" = Table.FillDown(#"Changed Type",{"UserActionID", "UserActionLabel"}),
#"Filtered Rows" = Table.SelectRows(#"Filled Down", each [start] > #datetime(1970, 1, 2, 0, 0, 0)),
#"Filtered Rows1" = Table.SelectRows(#"Filtered Rows", each [end] > #datetime(1970, 1, 2, 0, 0, 0)),
#"Grouped Rows" = Table.Group(#"Filtered Rows1", {"UserActionID", "UserActionLabel"}, {{"Start", each List.Min([start]), type nullable datetime}, {"End", each List.Max([end]), type nullable datetime}}),
#"Added Custom" = Table.AddColumn(#"Grouped Rows", "Duration", each [End]-[Start], type duration),
#"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"UserActionID"})
in
#"Removed Columns"

Here’s the output of this query for the interactions shown above:

Some notes about this query:

  • You will need to change the Source step to point to the JSON file you have exported from Performance Analyzer
  • Each interaction is represented by a row in the table and identified by the UserActionLabel column
  • I’m calculating the durations by finding the minimum start time and the maximum end time for all events associated with an interaction and subtracting the former from the latter
  • There’s a bug (which hopefully gets fixed at some point) where some events have start and end dates in 1970, so I have filtered out any dates that are obviously wrong
  • The Duration column shows how long each interaction took and uses the Power Query duration data type, which is formatted as days.hours:minutes:seconds

The example above is fairly complex showing several different kinds of interactions. If you just want to find the amount of time taken to render all the visuals on a page you can click the Refresh Visuals button in Performance Analyzer to refresh all the visuals on the page – it may not give you a 100% “cold cache” page render but it will be good enough. I’m not a web developer but I think to really do things properly you’ll need to open the report on a blank page in the browser, do an “Empty Cache Hard Reload“, go to edit mode in the report, enable Performance Analyzer, then move to the page you want to test. If you’re testing a DirectQuery model then you’ll also want to include the overhead of opening connections (which can be substantial); the only way I have found to do that is either wait for at least an hour for any connections in the pool to be dropped, or if you’re using a gateway to restart it. One last point to make is that while you can use Performance Analyzer in Power BI Desktop and in the browser the behaviour of Power BI may be different in these two places, so always make sure you measure performance of published reports in the browser because that’s where your users will be using your reports.

Here’s what clicking the Refresh Visuals button in Performance Analyzer to refresh all the visuals on a page looks like:

This results in a single interaction and a single row in the output of the Power Query query above:

In this case you can see that the page refresh took 12.14 seconds.

As you will have realised by now, getting the amount of time it takes to load a report page isn’t straightforward and there are a lot of factors to take into account. Nonetheless using Performance Analyzer in this way is much better than not measuring page load times at all or (as I’ve seen some people do) using a stopwatch. If you try this and find something interesting please let me know: I’m doing a lot of testing with Performance Analyzer and learning new things all the time.

Calculate(), DAX Fusion And Filters On 0 In Power BI

Here’s a fun DAX performance tip that I found this week. Do you have measures that use Calculate() with a filter on a numeric column? Is one of the filters on the value 0? If so then this may affect you.

As always, a simple example is the best way of explaining the problem and the solution. Consider the following table in an Import mode semantic model:

Here are some measures that reference that table:

'Sales Amount' = SUM('Sales'[SalesAmount])
'Apples Sales' = CALCULATE([Sales Amount], Sales[ProductID]=0)
'Oranges Sales' = CALCULATE([Sales Amount], Sales[ProductID]=1)
'Pears Sales' = CALCULATE([Sales Amount], Sales[ProductID]=2)
'Grapes Sales' = CALCULATE([Sales Amount], Sales[ProductID]=3)

Let’s say you then have a report which shows the value of the ‘Oranges Sales’ measure:

Running the DAX query generated by this table in DAX Studio with Server Timings enabled shows that there is just one Storage Engine query generated by this DAX query:

Here’s the xmSQL for that single SE query:

SET DC_KIND="AUTO";
SELECT
    SUM ( 'Sales'[SalesAmount] )
FROM 'Sales'
WHERE
    'Sales'[ProductID] = 1;

The WHERE clause here filters on the ProductID 1, which is the ID of the product Oranges.

Now if you add the measures ‘Pear Sales’ and ‘Grapes Sales’ to the table visual:

…and run the query in DAX Studio again, you’ll still see a single SE query:

Here’s the xmSQL:

SET DC_KIND="AUTO";
SELECT
    'Sales'[ProductID],
    SUM ( 'Sales'[SalesAmount] ),
    SUM ( 'Sales'[SalesAmount] ),
    SUM ( 'Sales'[SalesAmount] )
FROM 'Sales'
WHERE
    'Sales'[ProductID] IN ( 1, 2, 3 ) ;

As you can see, the WHERE clause now filters on the ProductIDs 1, 2 or 3: the IDs of the three products used in the three measures. This is DAX fusion – specifically horizontal fusion – in action. It’s an optimisation where multiple filters on the same column can be combined into a single SE query. Fewer SE queries is generally better for performance. So far so good.

Now let’s add the measure ‘Apples Sales’ to the table visual:

Running the DAX query in DAX Studio now shows there are two SE queries:

The first SE query has the same xmSQL as the previous DAX query:

SET DC_KIND="AUTO";
SELECT
    'Sales'[ProductID],
    SUM ( 'Sales'[SalesAmount] ),
    SUM ( 'Sales'[SalesAmount] ),
    SUM ( 'Sales'[SalesAmount] )
FROM 'Sales'
WHERE
    'Sales'[ProductID] IN ( 1, 2, 3 ) ;

The second SE query has the following xmSQL and a WHERE clause that indicates it is retrieving the data for just Apples:

SET DC_KIND="AUTO";
SELECT
    SUM ( 'Sales'[SalesAmount] )
FROM 'Sales'
WHERE
    'Sales'[ProductID] IN ( null, 0 ) ;

Two SE queries can mean worse overall performance. Why is the filter on 0 in the ‘Apples Sales’ measure special? Why does it result in a second SE query, why does this second SE query filter on 0 or null, and why doesn’t horizontal fusion take place for Apples?

The answer lies with how DAX handles blanks and zeroes, something discussed in depth in this article by Marco Russo. The filter condition in the ‘Apples Sales’ measure:

'Apples Sales' = CALCULATE([Sales Amount], Sales[ProductID]=0)

..actually filters on 0 or blank and that’s why the xmSQL of that second SE query filters on 0 or null, and that in turn explains why horizontal fusion does not take place – all the other measures filter on a specific number, the ‘Apples Sales’ measure filters on the number 0 or blank.

The solution is to update the measures in the model to use the strictly equal to == operator like so:

'Sales Amount' = SUM('Sales'[SalesAmount])
'Apples Sales' = CALCULATE([Sales Amount], Sales[ProductID]==0)
'Oranges Sales' = CALCULATE([Sales Amount], Sales[ProductID]==1)
'Pears Sales' = CALCULATE([Sales Amount], Sales[ProductID]==2)
'Grapes Sales' = CALCULATE([Sales Amount], Sales[ProductID]==3)

After this change the DAX query that returns the measures for ‘Apples Sales’, ‘Oranges Sales’, ‘Pear Sales’ and ‘Grapes Sales’ now generates a single SE query, meaning that horizontal fusion is taking place for all measures:

Here’s the xmSQL for that query:

SET DC_KIND="AUTO";
SELECT
    'Sales'[ProductID],
    SUM ( 'Sales'[SalesAmount] ),
    SUM ( 'Sales'[SalesAmount] ),
    SUM ( 'Sales'[SalesAmount] ),
    SUM ( 'Sales'[SalesAmount] )
FROM 'Sales'
WHERE
    'Sales'[ProductID] IN ( 0, 1, 2, 3 ) ;

As you can see, the WHERE clause now filters on the Product IDs 0, 1, 2 or 3.

This example uses an Import mode model but this tip also applies to DirectQuery mode and because additional SE queries (which mean additional SQL queries) can have more of an impact on performance in DirectQuery mode then ensuring horizontal fusion takes place can be even more important in DirectQuery mode.

I think this tip could benefit a lot of semantic models out there. A lot of measures use Calculate() and filter on numeric columns: the customer I was working with this week had measures that filtered on a year offset column on their date dimension table (so filtering on offset 0 meant this year, filtering on offset 1 meant last year and so on) and I reproduced the problem on my sample DirectQuery semantic model based on the ContosoDW sample database with a filter on the NumberChildrenAtHome column of the Customer dimension table. Also, I can’t remember the last time I saw the strictly equal to operator in the wild. If you find this tip helps you, please let me know by leaving a comment!

Measuring Time To Display For Image Visuals In Power BI With Performance Analyzer

Carrying on my series on troubleshooting Power BI performance problems with Performance Analyzer, another situation where a report may be slow even when the DAX queries it generates against the underlying semantic model are fast is when you have large images displayed in an Image visual. Let’s see an example.

I created a Power BI report consisting of a single image visual that displayed a 25MB photo:

I then published the report – and this is important because the behaviour I’m describing here will only be relevant to published report – then cleared the browser cache, started Performance Analyzer and viewed the page with the image visual on. I then stopped Performance Analyzer, exported the results and visualised them with my custom visual. Here’s what I saw:

The Image visual took just over 6 seconds to load and almost all of that time is related to the Visual Container Resource Load event. You can guess that this is the time taken to load the image from the name of the event; looking at what happens behind the scenes when the report renders using Chrome DevTools confirms this.

It’s important to understand that the Image visual won’t take this long to load every time the report is rendered – caching means that the image will only need to be downloaded once. Still it’s another reminder that having a large image or a lot of small images on a page can have an impact on report performance.

Measuring Geocoding Performance In Power BI Map Visuals Using Performance Analyzer

When a user complains about their Power BI report being slow it’s normal for the developer to open up DAX Studio and start tuning the measures in the semantic model because 95% of the time the problem is the DAX. But there’s a whole class of report performance problems that are nothing to do with the semantic model or DAX and they are the ones that are much easier to troubleshoot now that Performance Analyzer is available in the browser as well as Desktop. Today I want to look at one of them: geocoding in map visuals.

What is geocoding? Let me show you an example. I have a semantic model with data about UK real estate sales where each row in the fact table represents the sale of some kind of property like a house or apartment. The model has the address of the property sold and the address includes a postcode (similar to a US zip code). I added an Azure Map visual onto a report and dragged my Postcode data into the Location well of the visual and got this:

Postcodes are just text values. How does Power BI know where each postcode is on a map? It does so by geocoding: sending each postcode to a web service which returns a latitude and longitude for that postcode which allows its location to be plotted. And if, as in this case, you have thousands of postcodes to geocode then this process can be slow.

Because geocoding can be slow Power BI will cache geocoded locations but you can turn this caching off on the Diagnostics tab of the Options dialog using the “Bypass geocoding cache” option:

I did this to get worst-case performance and then refreshed my Azure Map visual with Performance Analyzer running, then exported the output of Performance Analyzer and visualised it with my Performance Analyzer custom visual. Here’s what it showed:

What does this show us?

  • The visual as a whole took 107 seconds to display, as shown by the duration of the Visual Container Lifecycle event
  • The underlying DAX query (which returns 30002 rows – not all the postcodes but the limit of what can be displayed) took a fraction of a second to return
  • The visual took 6 seconds to render
  • The geocoding took 106 seconds

So, a great example of where a report is slow but where the model and its measures is not the cause and where Performance Analyzer is the only way to see what the actual cause is.

What can you do to improve performance? Well as I said, with caching enabled (as it is by default) then performance would be nowhere near this bad, but there are some things you can do. First of all it’s obvious that the visual itself displays more data than any user can make sense of and Power BI is sampling data in this case too, so redesigning the report so the user only sees a useable, comprehensible subset of the data on the map would be a good start. You could also geocode the data yourself during your ETL rather than waiting for Power BI to do it when it displays the report – postcodes are not going to change location – and this would give you latitudes and longitudes you can give the Azure Map visual instead. Finally, it’s probably worth experimenting with different visuals, including other map custom visuals, to see if they perform differently for your requirements.

Diagnosing Power BI DirectQuery Connection Limit Problems With Performance Analyzer

To kick off my series on diagnosing Power BI performance problems with Performance Analyzer in the browser (which I introduced last week with my post on vide-coding a custom visual to visualise Performance Analyzer data), I want to revisit a subject I blogged about two years ago: how hitting the limit on the maximum number of connections to a DirectQuery data source can lead to queries queuing for an available connection and performance problems. In my original post on this topic I showed how you can use the Execution Metrics event in Profiler/Log Analytics/Workspace Monitoring to see when this queuing happens. In this post I will show how you can do exactly the same thing with Performance Analyzer.

Here’s the semantic model I used in my previous post: it has three tables in DirectQuery mode connected to SQL Server. Each table consists of a single row and column and is bound to a SQL query that takes 10 seconds to run (using the TSQL custom function I blogged about here).

Here’s the report connected to this model, containing three cards, each of which display the single value returned by each of these three tables. As you would expect, the DAX queries associated with each of these card visuals takes 10 seconds to run when run in isolation.

With the Max Connections Per Data Source property set to the default value of 10:

…I ran the report in the browser with Performance Analyzer running. Here’s what I saw in the Performance Analyzer pane:

No surprises: the DirectQuery timings are all around 10 seconds. I exported the Performance Analyzer data and loaded it into my custom visual. The events for the three card visuals were all very similar:

I then set the Max Connections Per Data Source property on the semantic model to 1, so there was only one connection available back to SQL Server, and reran the report with Performance Analyzer running. Here’s what Performance Analyzer showed in the browser this time:

The fact that the DirectQuery activity for Table C took 13 seconds, the DirectQuery activity for Table B took 24 seconds and the DirectQuery activity for Table A took 35 seconds suggests that there’s some queuing happening but there’s nothing here that tells you that for sure. But exporting the data from Performance Analyzer and loading it into my visual showed the following for Table C:

Table B:

And Table A:

Note how for Table C the bar for the Get Source Connection event is very small, but for Table B it’s around 12 seconds and for Table A it’s around 24 seconds. This tells you exactly what the problem was: queuing for a connection.

As I said, you can get the same information from the Execution Metrics event but installing Profiler or capturing this data with Log Analytics or Workspace Monitoring isn’t always an option; this is a lot more convenient.

A Look At The Impact Of Calendar Based Time Intelligence On Power BI DirectQuery Performance

Calendar-based time intelligence (see here for the announcement and here for Marco and Alberto’s more in-depth article) is at least the second-most exciting thing to happen in DAX in the last few months: it makes many types of time intelligence calculation much easier to implement. But as far as I know only Reid Havens, in this video, has mentioned the performance impact of using this new feature and that was for Import mode. So I wondered: do these benefits also apply to DirectQuery mode? The answer is on balance yes but it’s not clear-cut.

To illustrate what I mean, I built a simple DirectQuery model against the Adventure Works DW sample database in SQL Server:

This model used the old “Mark as date table” time intelligence.

Here are the definitions of some of the measures:

Sales Amount =
SUM ( 'Internet Sales'[SalesAmount] )

YTD Sales Amount =
CALCULATE ( [Sales Amount], DATESYTD ( 'Date'[FullDateAlternateKey] ) )

PY YTD Sales Amount =
CALCULATE (
    [YTD Sales Amount],
    SAMEPERIODLASTYEAR ( 'Date'[FullDateAlternateKey] )
)

The Sales Amount measure returns the sum of the values in the SalesAmount column; the YTD Sales Amount finds the year-to-date sum of Sales Amount; and PY YTD Sales Amount finds the value of this measure in the same period of the previous year.

I then created a matrix visual showing the PY YTD Sales Amount measure with EnglishProductName from the Product dimension on columns and CalendarYear and EnglishMonthName from the Date dimension on rows:

I copied the DAX query for this visual from Performance Analyzer, pasted it into DAX Studio and then ran it on a cold cache with Server Timings enabled. Here’s what Server Timings showed:

A total duration of 1.9 seconds and 5 SE queries doesn’t look too bad. But here are the Execution Metrics for this query with some important metrics highlighted:

{ 
"timeStart": "2025-11-29T17:17:16.461Z", 
"timeEnd": "2025-11-29T17:17:18.350Z",  
"durationMs": 1890, 
"datasourceConnectionThrottleTimeMs": 0, 
"directQueryConnectionTimeMs": 24, 
"directQueryIterationTimeMs": 166, 
"directQueryTotalTimeMs": 1681, 
"externalQueryExecutionTimeMs": 1493, 
"queryProcessingCpuTimeMs": 16, 
"totalCpuTimeMs": 828, 
"executionDelayMs": 3,  
"approximatePeakMemConsumptionKB": 20977,  
"directQueryTimeoutMs": 3599000, 
"tabularConnectionTimeoutMs": 3600000,  
"commandType": "Statement", 
"queryDialect": 3, 
"queryResultRows": 1613, 
"directQueryRequestCount": 5, 
"directQueryTotalRows": 33756 
}

The important thing to notice is that while the DAX query returns 1613 rows (see the queryResultRows metric) the SQL queries generated for that DAX query return 33756 rows between them (see the directQueryTotalRows metric). Why the big difference? This is because to do the year-to-date calculation using the old time intelligence functionality, Power BI has to run a query at the date granularity, which explains why there are so many more rows returned by the SQL queries. For example here’s a snippet of the last SQL query generated:

Yuck. What’s more, bringing this number of rows from the source can be time-consuming and even after these rows have made it to Power BI, they need to be iterated over (see the directQueryIterationTimeMs metric of 166ms) and aggregated up to get the final result of the calculation. This requires memory (see the approximatePeakMemConsumptionKB metric of 20977KB) and CPU (see the totalCpuTimeMs metric of 828ms) as well as adding to the overall duration of the DAX query.

I then created a copy of this model and set up a calendar using the new calendar-based time intelligence feature like so:

I then modified the measures above to use this new calendar:

Sales Amount =
SUM ( 'Internet Sales'[SalesAmount] )

YTD Sales Amount =
CALCULATE ( [Sales Amount], DATESYTD ( 'Gregorian' ) )

PY YTD Sales Amount =
CALCULATE ( [YTD Sales Amount], SAMEPERIODLASTYEAR ( 'Gregorian' ) )

I then reran the same DAX query from my matrix visual in DAX Studio for this model. Here are the Server Timings:

The good news is that the query is now much faster: 0.5 seconds instead of 1.9 seconds. But there are more SE queries! I’m told this is because some fusion optimisations (this presentation by Phil Seamark is an excellent introduction to this subject) haven’t yet been implemented for the new calendar-based time intelligence functionality yet, which means more SQL queries are generated than you might expect. Indeed some of the SQL queries run are identical. And since there is a limit on the number of connections that Power BI can use to run SQL queries in DirectQuery mode, and since you can run into performance problems when you hit those limits (see here for more details), then more SQL queries can be a bad thing – especially when there are many visuals on a page or a lot of concurrent users using the same semantic model.

However there is more good news if you look closely. Here are the Execution Metrics for this second run:

{ 
"timeStart": "2025-11-29T17:34:56.223Z", 
"timeEnd": "2025-11-29T17:34:56.754Z",  
"durationMs": 531, 
"datasourceConnectionThrottleTimeMs": 0, 
"directQueryConnectionTimeMs": 41, 
"directQueryIterationTimeMs": 38, 
"directQueryTotalTimeMs": 465, 
"externalQueryExecutionTimeMs": 410, 
"queryProcessingCpuTimeMs": 16, 
"totalCpuTimeMs": 141, 
"executionDelayMs": 0,  
"approximatePeakMemConsumptionKB": 3812,  
"directQueryTimeoutMs": 3600000, 
"tabularConnectionTimeoutMs": 3600000,  
"commandType": "Statement", 
"queryDialect": 3, 
"queryResultRows": 1613, 
"directQueryRequestCount": 11, 
"directQueryTotalRows": 3369 
}

Even though there are more SQL queries now the total number of rows returned by them is much less: the directQueryTotalRows metric is only 3369, so about 10% of what it was before. Why? Because instead of having to go down to the date granularity to do the calculations, the new calendar-based time intelligence functionality allows Power BI to do the calculation at the month granularity. Here’s a snippet of one of the SQL queries generated that shows this:

This in turn means that directQueryIterationTimeMs (now only 38ms), totalCpuTimeMs (now only 141ms) and approximatePeakMemConsumptionKB (now only 3812KB) are all much less than before. Also, this could mean you’re less likely to run into the Max Intermediate Row Set Count limit on the maximum number of rows that a DirectQuery SQL query can return and it opens up more opportunities to use aggregations to improve performance.

As a result, if you’re running into query performance, CU or memory-related problems in DirectQuery mode, you should experiment with using the new calendar-based time intelligence feature to see if it can help even if it results in more SQL queries being generated. Hopefully when those fusion optimisations are implemented in the future the benefits will be even greater.

Finally, it’s also worth mentioning that using Visual Calculations or Window functions (as discussed here) have very similar benefits when tuning DirectQuery mode, so you should check them out too and consider using them in combination with calendar-based time intelligence.

Linking Queries Run From DAX Studio To Workspace Monitoring And The Capacity Metrics App

A few weeks ago I wrote a blog post about how you can now link event data from Workspace Monitoring to data in the Fabric Capacity Metrics App using OperationId values. In the latest (3.4.0) release of DAX Studio there’s a new feature that you might have missed that link queries run from there to Workspace Monitoring and the Capacity Metrics App in the same way.

I connected DAX Studio to a published Power BI semantic model, turned on Server Timings and ran a DAX query. When the query finished I clicked the Info button at the top of the Server Timings pane, which opened the Query Information dialog shown below, and then I copied the Request ID value from there:

The Request ID is the same thing as the OperationId value in Workspace Monitoring and the Capacity Metrics app. I was then able to use this value in a KQL query in Workspace Monitoring like so:

SemanticModelLogs
| where Timestamp > ago(1h)
| where toupper(OperationId) == "5D4A4F47-370D-4634-B67B-E4B58CB067A8"
| project Timestamp, OperationName, OperationDetailName, EventText, Status, StatusCode
| order by Timestamp asc

[Note that you need to convert the OperationID value in Workspace Monitoring to uppercase to match it to the Request ID from DAX Studio]

What’s the value of doing this? Probably not much for an individual query because the information you see in Workspace Monitoring is exactly the same as you get in Server Timings in DAX Studio, and DAX Studio shows this information in a way that’s much easier to understand. If you’re testing a number of DAX queries, though, then having all this data available in Workspace Monitoring means you can do detailed comparisons of different runs using KQL.

What is really useful is being able to find the CU usage on a capacity of a DAX query run from DAX Studio – something that is very important when tuning DAX queries. While DAX Studio (and Workspace Monitoring) will give you a CPU Time value for a DAX query, for reasons I won’t go into here you won’t be able to reverse engineer the algorithm that converts CPU Time to CUs. However by linking the RequestId from DAX Studio to the OperationIds seen on the Timepoint Detail and the new Timepoint Item Detail page in newer versions of the Capacity Metrics app (as discussed here) you don’t need to care about that. You can simply take the Request ID from DAX Studio and find it in the Capacity Metrics app. For example on the Timepoint Item Detail (preview) page you can either find the value in the OperationId slicer or add the OperationId column to the lower table using the Select Optional Column(s) slicer and look for it there:

Performance Testing Power BI Direct Lake Models Revisited: Ensuring Worst-Case Performance

Two years ago I wrote a detailed post on how to do performance testing for Direct Lake semantic models. In that post I talked about how important it is to run worst-case scenario tests to see how your model performs when there is no model data present in memory, and how it was possible to clear all the data held in memory by doing a full refresh of the semantic model. Recently, however, a long-awaited performance improvement for Direct Lake has been released which means a full semantic model refresh may no longer page all data out of memory – which is great, but which also makes running performance tests a bit more complicated.

First of all, what is this new improvement? It’s called Incremental Framing and you can read about it in the docs here. Basically, instead of clearing all data out of memory when you do a full refresh of a Direct Lake model, the model now checks each Delta table it uses to see whether the data in it has actually changed. If it hasn’t changed then there’s no need to clear any data from that table out of memory. Since there’s a performance overhead to loading data into memory when a query runs this means that you’re less likely to encounter this overhead, and queries (especially for models where the data in some tables changes frequently) will be faster overall. I strongly recommend you to read the entire docs page carefully though, not only because it contains a lot of other useful information, but also because you might be loading data into your lakehouses in a way that prevents this optimisation from working.

Let me show you an example of this by revisiting a demo from a session I’ve done at several user groups and conferences on Power BI model memory usage (there are several recordings of it available, such as this one). Using a Direct Lake semantic model consisting of a single large table with 20 columns containing random numbers, if I use DAX Studio’s Model Metrics feature when there is no data held in memory and with the Direct Lake Behaviour setting in DAX Studio’s Options dialog set to ResidentOnly (to stop Model Metrics from loading data from all columns into memory when it runs):

Then when you run Model Metrics the size of each column in the semantic model is negligible and the Temperature and Last Accessed for all model columns are blank:

The, if I run a query that asks for data from just one column (in this case the column called “1”) from this table like this:

EVALUATE ROW("Test", DISTINCTCOUNT('SourceData'[1]))

Then rerun Model Metrics then the size in memory for that column changes, because of course it has been loaded into memory in order to run the query:

Zooming in on the Model Metrics table columns from the previous screenshot that show the size in memory:

And here are the Temperature and Last Accessed columns from the same screenshot which are no longer blank:

Since the query had to bring the column into memory before it could run, the DAX query took around 5.3 seconds. Running the same query after that, even after using the Clear Cache button in DAX Studio, took about only 0.8 seconds because the data needed for the query was already resident in memory.

OK, so far nothing has changed in terms of behaviour. However if you do a full refresh from the Power BI UI without making any changes to the underlying Delta tables:

And then rerun the Model Metrics, nothing changes and the data is still in memory! As a result the DAX query above still only takes about 0.8 seconds.

So how do you get that worst-case performance again? As mentioned in the docs here, you now need to do a refresh of type clearValues followed by a full refresh. You can’t do a refresh of type clearValues in the Power BI UI though, so the easiest way to do is to use a Fabric notebook and Semantic Link Labs. Here’s how. First install Semantic Link Labs:

%pip install semantic-link-labs

Then use the following code in a notebook cell to do a refresh of type clearValues followed by a full refresh:

import sempy_labs as labs
WorkspaceName = "Insert Workspace Name Here"
SemanticModelName = "Insert Semantic Model Name Here"
# run a refresh of type clearValues first
labs.refresh_semantic_model(dataset=SemanticModelName, workspace=WorkspaceName, refresh_type="clearValues")
# then a refresh of type full
labs.refresh_semantic_model(dataset=SemanticModelName, workspace=WorkspaceName, refresh_type="full")

After doing this on my model, Model Metrics shows that the column called “1” that was previously in memory is no longer resident:

…and the query above once again takes 5 seconds to run.

So, as you can see, if you’re doing performance testing of a Direct Lake model you now need to make sure you do a refresh of type clearValues and a full refresh of your model before each test to ensure no data is resident in memory and get worst-case performance readings, in addition to testing performance on a cold cache and a warm cache.