Query Folding And Staging In Fabric Dataflows Gen2

A few years ago I wrote this post on the subject of staging in Fabric Dataflows Gen2. In it I explained what staging is, how you can enable it for a query inside a Dataflow, and discussed the pros and cons of using it. However one thing I never got round to doing until this week is looking at how you can tell if query folding is happening on staged data inside a Dataflow – which turns out to be harder to do than you might think.

Consider the following simple Dataflow consisting of two queries:

The first query, called StageData, reads data from a 6GB, 17 million row CSV file containing open data from the English Prescribing Data dataset. It returns two columns called PRACTICE_NAME and TOTAL_QUANTITY from that CSV file:

Staging is enabled on this query:

The second query, called GroupBy, takes the data returned by the StageData query and does a Group By operation to get the sum of the values in the TOTAL_QUANTITY column for each distinct value in PRACTICE_NAME:

The output of this query was loaded to a Fabric Warehouse:

The scenario is basically the same as the one from my previous post but with a much larger data volume, and the idea was to test again whether it was faster to stage the data and do the Group By on the staged data or to not stage the data and do the Group By while reading the data direct from the source.

It turned out that, once again, staging made performance worse (don’t worry, I have other tests that show it can help performance). But the point about staging is that by loading the data from a query into a hidden Fabric Lakehouse, managed by the Dataflow (which is what is meant by “staging”), any subsequent operations on this data are faster because query folding can take place against the SQL Endpoint of this hidden Lakehouse – and at the time of writing this post there’s no way of knowing from the Dataflows Editor whether query folding is taking place. Right-clicking on the step that does the Group By operation shows that the “View data source query” option is greyed out but this only tells you that you the Editor doesn’t know if folding is taking place:

In fact other things in the UI, such as the query plan and the query folding indicators, suggest incorrectly that folding is not taking place:

So I thought: if query folding is taking place then the Group By will result in a SQL query run against the hidden Lakehouse, so maybe I can see this SQL query somewhere? Unfortunately since the Lakehouse is hidden you can’t get to it through the Fabric web interface. But then I remembered that you can connect to a Fabric workspace using good old SQL Server Management Studio (instructions on how to can be found here). And when I connected using SSMS I could see two hidden objects created by by Dataflow called StagingLakehouseForDataflows and StagingWarehouseForDataflows:

I was then able to run a SQL query using the queryinsights.exec_requests_history DMV against StagingWarehouseForDataflows, filtered for the time range when my Dataflow refresh was taking place:

SELECT start_time,statement_type, command, total_elapsed_time_ms
FROM
queryinsights.exec_requests_history
WHERE
start_time>'insert DF refresh start time here'
AND end_time<'insert DF refresh end time here'
ORDER BY start_time desc

…and saw the following INSERT INTO statement that did the Group By operation I was expecting to see along with how long it took:

insert into
[StagingWarehouseForDataflows_20260223144925].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_2930fd3c_002D2a62_002D4518_002Dafbf_002D249e7af54403]
([Column1], [Column2])
select [_].[Column1] as [Column1],
convert(float, [_].[Total Quantity]) as [Column2]
from (
select [rows].[Column1] as [Column1],
sum([rows].[Column2]) as [Total Quantity]
from [StagingLakehouseForDataflows_20260223144911].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_179186be_002D367d_002D4924_002Da8ba_002Dd1f220415e3a]
as [rows]
group by [Column1] )
as [_]

So, a useful tip if you’re performance tuning a Dataflow even if it’s a bit of a pain to do. Hopefully in the future we’ll be able to see the SQL generated when query folding takes place against a staged table.

[Thanks to Miguel Escobar for his help with this]

New Semi Join, Anti Join And Query Folding Functionality In Power Query

There are a couple of nice new features to do with table joins (or merges as they are known in M) and query folding in Power Query in the April release of Power BI Desktop that I want to highlight.

Anti Joins now fold

First of all, a few months ago I wrote a post about how the built-in anti join functionality didn’t fold in Power Query. The good news is that it now does on SQL Server-related sources, so no more workarounds are needed. For example, if you have two tables in a SQL Server database called Fruit1 and Fruit2 and two Power Query queries that get data from those tables:

…then the following M code:

let
  Source = Table.Join(
    Fruit1,
    {"Fruit1"},
    Fruit2,
    {"Fruit2"},
    JoinKind.LeftAnti
  )
in
  Source

…returns the following table of fruits that are in the Fruit1 table and not in the Fruit2 table:

Of course that’s what the code above returned in previous versions of Power Query too. The difference now is that query folding occurs and the following SQL code is generated:

select [$Outer].[Fruit1],
    cast(null as nvarchar(50)) as [Fruit2]
from 
(
    select [_].[Fruit] as [Fruit1]
    from [dbo].[Fruit1] as [_]
) as [$Outer]
where not exists 
(
    select 1
    from 
    (
        select [_].[Fruit] as [Fruit2]
        from [dbo].[Fruit2] as [_]
    ) as [$Inner]
    where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null

New join kind: semi joins

There are also two brand new join kind you can use in the Table.Join and Table.NestedJoin functions: JoinKind.LeftSemi and JoinKind.RightSemi. Semi joins allow you to select the rows in one table that have matching values in another table. Using the Fruit1 and Fruit2 tables above, the following M code:

let
  Source = Table.Join(
    Fruit1, 
    {"Fruit1"}, 
    Fruit2, 
    {"Fruit2"}, 
    JoinKind.LeftSemi
  )
in
  Source

Returns all the rows in Fruit1 where there is a matching value in Fruit2:

Here’s the SQL that is generated:

select [$Outer].[Fruit1],
    cast(null as nvarchar(50)) as [Fruit2]
from 
(
    select [_].[Fruit] as [Fruit1]
    from [dbo].[Fruit1] as [_]
) as [$Outer]
where exists 
(
    select 1
    from 
    (
        select [_].[Fruit] as [Fruit2]
        from [dbo].[Fruit2] as [_]
    ) as [$Inner]
    where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null

The ?? operator now folds

The M language’s ?? coalesce operator is used for replacing null values and this now folds on SQL Server-related sources too now. For example, the M query in the previous section that did a semi join on Fruit1 and Fruit2 returns a table where all the rows in the Fruit2 colum contain null values. The following M query adds a new custom column that returns the text value “Nothing” when the Fruit2 column contains a null:

let
  Source = Table.Join(
    Fruit1, 
    {"Fruit1"}, 
    Fruit2, 
    {"Fruit2"}, 
    JoinKind.LeftSemi
  ), 
  ReplaceNulls = Table.AddColumn(
    Source, 
    "NullReplacement", 
    each [Fruit2] ?? "Nothing"
  )
in
  ReplaceNulls

Here’s the SQL generated for this, where the ?? operator is folded to a CASE statement:

select [_].[Fruit1] as [Fruit1],
    [_].[Fruit2] as [Fruit2],
    case
        when [_].[Fruit2] is null
        then 'Nothing'
        else [_].[Fruit2]
    end as [NullReplacement]
from 
(
    select [$Outer].[Fruit1],
        cast(null as nvarchar(50)) as [Fruit2]
    from 
    (
        select [_].[Fruit] as [Fruit1]
        from [dbo].[Fruit1] as [_]
    ) as [$Outer]
    where exists 
    (
        select 1
        from 
        (
            select [_].[Fruit] as [Fruit2]
            from [dbo].[Fruit2] as [_]
        ) as [$Inner]
        where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null
    )
) as [_]

[Thanks to Curt Hagenlocher for the information in this post]