Query Folding And Staging In Fabric Dataflows Gen2

A few years ago I wrote this post on the subject of staging in Fabric Dataflows Gen2. In it I explained what staging is, how you can enable it for a query inside a Dataflow, and discussed the pros and cons of using it. However one thing I never got round to doing until this week is looking at how you can tell if query folding is happening on staged data inside a Dataflow – which turns out to be harder to do than you might think.

Consider the following simple Dataflow consisting of two queries:

The first query, called StageData, reads data from a 6GB, 17 million row CSV file containing open data from the English Prescribing Data dataset. It returns two columns called PRACTICE_NAME and TOTAL_QUANTITY from that CSV file:

Staging is enabled on this query:

The second query, called GroupBy, takes the data returned by the StageData query and does a Group By operation to get the sum of the values in the TOTAL_QUANTITY column for each distinct value in PRACTICE_NAME:

The output of this query was loaded to a Fabric Warehouse:

The scenario is basically the same as the one from my previous post but with a much larger data volume, and the idea was to test again whether it was faster to stage the data and do the Group By on the staged data or to not stage the data and do the Group By while reading the data direct from the source.

It turned out that, once again, staging made performance worse (don’t worry, I have other tests that show it can help performance). But the point about staging is that by loading the data from a query into a hidden Fabric Lakehouse, managed by the Dataflow (which is what is meant by “staging”), any subsequent operations on this data are faster because query folding can take place against the SQL Endpoint of this hidden Lakehouse – and at the time of writing this post there’s no way of knowing from the Dataflows Editor whether query folding is taking place. Right-clicking on the step that does the Group By operation shows that the “View data source query” option is greyed out but this only tells you that you the Editor doesn’t know if folding is taking place:

In fact other things in the UI, such as the query plan and the query folding indicators, suggest incorrectly that folding is not taking place:

So I thought: if query folding is taking place then the Group By will result in a SQL query run against the hidden Lakehouse, so maybe I can see this SQL query somewhere? Unfortunately since the Lakehouse is hidden you can’t get to it through the Fabric web interface. But then I remembered that you can connect to a Fabric workspace using good old SQL Server Management Studio (instructions on how to can be found here). And when I connected using SSMS I could see two hidden objects created by by Dataflow called StagingLakehouseForDataflows and StagingWarehouseForDataflows:

I was then able to run a SQL query using the queryinsights.exec_requests_history DMV against StagingWarehouseForDataflows, filtered for the time range when my Dataflow refresh was taking place:

SELECT start_time,statement_type, command, total_elapsed_time_ms
FROM
queryinsights.exec_requests_history
WHERE
start_time>'insert DF refresh start time here'
AND end_time<'insert DF refresh end time here'
ORDER BY start_time desc

…and saw the following INSERT INTO statement that did the Group By operation I was expecting to see along with how long it took:

insert into
[StagingWarehouseForDataflows_20260223144925].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_2930fd3c_002D2a62_002D4518_002Dafbf_002D249e7af54403]
([Column1], [Column2])
select [_].[Column1] as [Column1],
convert(float, [_].[Total Quantity]) as [Column2]
from (
select [rows].[Column1] as [Column1],
sum([rows].[Column2]) as [Total Quantity]
from [StagingLakehouseForDataflows_20260223144911].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_179186be_002D367d_002D4924_002Da8ba_002Dd1f220415e3a]
as [rows]
group by [Column1] )
as [_]

So, a useful tip if you’re performance tuning a Dataflow even if it’s a bit of a pain to do. Hopefully in the future we’ll be able to see the SQL generated when query folding takes place against a staged table.

[Thanks to Miguel Escobar for his help with this]

When Can Partitioned Compute Help Improve Fabric Dataflow Performance?

Partitioned Compute is a new feature in Fabric Dataflows that allows you to run certain operations inside a Dataflow query in parallel and therefore improve performance. While UI support is limited at the moment it can be used in any Dataflow by adding a single line of fairly simple M code and checking a box in the Options dialog. But as with a lot of performance optimisation features (and this is particularly true of Dataflows) it can sometimes result in worse performance rather than better performance – you need to know how and when to use it. And so, in order to understand when this feature should and shouldn’t be used, I decided to do some tests and share the results here.

For my tests I created two queries within a single Dataflow Gen2 CICD. First, an M function called SlowFunction that takes a numeric value and returns that value with 1 added to it after a two second delay:

(input as number) as number =>
Function.InvokeAfter(()=>input+1, #duration(0,0,0,2))

Then the main query which returns a table of ten rows and calls the SlowFunction M function once per row:

let
Rows = List.Transform({1 .. 10}, each {_}),
MyTable = #table(type table [RowNumber = number], Rows),
#"Added custom" = Table.TransformColumnTypes(
Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])),
{{"FunctionOutput", Int64.Type}}
)
in
#"Added custom"

Here’s what the output of the query looks like:

Now, the first important question. How long does this Dataflow take to run? 10 rows calling a function that takes 2 seconds to run, so 10*2=20 seconds maybe? The answer is yes if you look at how long the preview takes to populate in the Dataflow Editor:

That’s just the preview though. If you’re refreshing a Dataflow there are other things that happen that affect how long that refresh takes, such as Staging and loading the data to a destination. There’s no way you can split the performance of your M code from these operations when looking at the duration of a Dataflow refresh in Recent Runs, which explains why some of the timings you will see later in this post seem strange. Don’t worry, though, it doesn’t stop you from seeing the important trends. I’m told that setting a CSV file in a Lakehouse as your data destination is the best way of minimising the impact of loading data on overall refresh durations but at the time of writing the CSV destination can’t be used with Partitioned Compute so all my tests used a Fabric Warehouse as a destination.

Here’s what Recent Runs showed when this Dataflow was refreshed:

The overall refresh time was 59 seconds; the query (called NonPartitioned here) that returns the table of ten rows and which was staged took 29 seconds.

Could this Dataflow benefit from Partitioned Compute? With Partitioned Compute enabled in the Options dialog, I added the necessary M code to the query:

let
Rows = List.Transform({1 .. 10}, each {_}),
MyTable = #table(type table [RowNumber = number], Rows),
#"Added custom" = Table.TransformColumnTypes(
Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])),
{{"FunctionOutput", Int64.Type}}
),
ReplacePartitionKey = Table.ReplacePartitionKey(#"Added custom", {"RowNumber"})
in
ReplacePartitionKey

…and then refreshed. Here’s what Recent Runs showed:

The overall refresh duration went up to 1 minute 25 seconds; the query that does all the work (called Partitioned in this case) took 40 seconds. Note the screenshot immediately above shows the engine used is “PartitionedCompute” and that there are now ten activities listed instead of one: my M code used the RowNumber column in the table as the partition key so the Dataflow attempted to run each row of the table as a separate operation in parallel. And as you can see, this made performance worse. This is because using Partitioned Compute introduces yet another overhead and that overhead is much greater than any benefit gained from parallelism in this case.

So I wondered: what if the delay in the query is increased from 2 second to 100 seconds then? Does this increase in the delay mean that parallelism results in faster overall performance?

Here’s what Recent Runs showed for a version of my Dataflow with a 100 second delay for each row and which didn’t use Partitioned Compute:

10 rows * 100 seconds = 1000 seconds = 16 minutes 40 seconds, so it’s not surprising that the overall duration of this version of the Dataflow was slow at 17 minutes 29 seconds.

Here’s what Recent Runs shows for the version of this Dataflow that did use Partitioned Compute:

The overall duration was 4 minutes 41 seconds and the main query took 3 minutes 14 seconds. The important takeaway is that this is a lot faster than the version that didn’t use Partitioned Compute, so clearly Partitioned Compute made a big difference to performance here. As you might expect, it looks like parallelising operations that only take a few seconds results in worse performance while parallelising operations that take longer, say a minute or more, is probably a good idea. As always, you’ll need to test to see what benefits you get for your Dataflows.

These results raise a lot of questions too. 100 seconds = 1 minute 40 seconds, which is a lot less than 3 minutes 14 seconds. Does this mean that not every row in the table was evaluated in parallel? Is partitioning on the RowNumber column counter-productive and would it be better to partition in some other way to try to reduce the amount of attempted parallelism? Is there something else that is limiting the amount of parallelism? While this version of the Dataflow always performs better than the non-partitioned version, performance did vary a lot between refreshes. While these tests show how useful Partitioned Compute can be for slow Dataflows, there’s a lot more research to do and a lot more blog posts to write.

A Closer Look At Preview-Only Steps In Fabric Dataflows

I have been spending a lot of time recently investigating the new performance-related features that have rolled out in Fabric Dataflows over the last few months, so expect a lot of blog posts on this subject in the near future. Probably my favourite of these features is Preview-Only steps: they make such a big difference to my quality of life as a Dataflows developer.

The basic idea (which you can read about in the very detailed docs here) is that you can add steps to a query inside a Dataflow that are only executed when you are editing the query and looking at data in the preview pane; when the Dataflow is refreshed these steps are ignored. This means you can do things like add filters, remove columns or summarise data while you’re editing the Dataflow in order to make the performance of the editor faster or debug data problems. It’s all very straightforward and works well.

The more I thought about this feature, though, the more I wondered about how it actually works and how it can be used in more complex queries that involve hand-written M code – so I decided to do a few tests. First of all, consider the following M query:

let
Step1 = 1,
Step2 = Step1 - 1
in
Step2

This query returns a numeric value: 0.

At this point it will return the same value in the editor when viewing the preview of the output and when the Dataflow refreshes. But if you right-click on Step2 and select “Enable only in previews” then the query still returns 0 in the preview but will return 1 when the Dataflow refreshes. You will also see this message displayed in the preview pane:

This data preview uses preview-only steps.
Results may differ when running the dataflow.

This makes sense because the query follows a linear pattern: the output references Step2 which in turn references Step1 so if you disable Step2 then the the output simply skips it and returns the value of Step1.

But what if your M code is more complex? For example consider this query:

let
x = 5,
y = 8,
xtimesy = x*y,
outputtable = #table(type table[xy=number],{{xtimesy}})
in
outputtable

Here’s what this returns in the editor: a table that contains the value 40.

What if you disable the step called “y”?

Here’s what I saw in my Warehouse when I loaded the output of the query to it:

I had no idea what the output would be before I saw it but, thinking about it, I assume what has happened is that since the step “y” has been disabled, “y” simply returns the value of the previous step in the query, “x”, and therefore the output becomes 5*5=25. This understanding of how preview-only steps work is backed-up by the fact that it is not possible to set step “x” to be preview-only. The option to do so is greyed out:

This all leads on to a pattern I used in a Dataflow this week and which I can see myself using a lot more in the future. While being able to disable steps when the Dataflow refreshes is very useful, sometimes what you need is to write some conditional logic in your M code that does one thing if you’re in the editor and another thing if the Dataflow is refreshing. Here’s a slightly modified version of the first query above which I have called IsRefresh in my Dataflow:

let
RefreshValue = 1,
PreviewValue = RefreshValue-1
in
PreviewValue

With the PreviewValue step set to be preview-only:

…then what we have here is a query that returns 1 when the Dataflow is refreshing and 0 when you’re viewing its output in the editor. Here’s an example of how it can be used in a query:

let
Source = #table(
type table [
NumericValue = number,
TextValue = text
],
{
{
IsRefresh,
if IsRefresh = 1 then
"This is a refresh"
else
"This is a preview"
}
}
)
in
Source

This query returns the following when you preview the output in the editor:

But when the Dataflow refreshes, the output in the destination is this:

Generating Sample Data With Copilot In Fabric Dataflows Gen2

As a Power Query fan I’m naturally interested in Copilot in Dataflows Gen2. But since Power Query is already very easy to use and since I’m pretty experienced at using it, can Copilot really do anything to help me be more productive? Well one thing I need to do a lot for blog posts, demos and testing is to generate tables of sample data. Typing data into the Enter Data source can be quite tedious and writing the M code to generate sample data isn’t straightforward, so I was wondering if Copilot in Dataflows Gen2 could do it for me. It turns out it can do so really well.

After a little bit of experimentation I’ve come up with an example prompt that can be easily modified to return a table of random values with columns of different data types and a given number of rows, and which works fairly reliably:

Return a table consisting of the following five columns: 
Country, containing one of 10 random country names; 
Product, containing one of 15 random fruit names; 
Person, containing one of 20 random first names that are common in the United Kingdom;
Date, containing random dates between January 1st 2019 and December 31st 2024;
Sales, containing random integers between 0 and 500. 
The table should have 1000 rows. Set an appropriate data type for each column.

Here’s an example of the M code generated:

let
    // Generate lists of random values
    Countries = {"United States", "Canada", "United Kingdom", "Germany", "France", "Italy", "Spain", "Australia", "India", "China"},
    Products = {"Apple", "Banana", "Cherry", "Date", "Elderberry", "Fig", "Grape", "Honeydew", "Kiwi", "Lemon", "Mango", "Nectarine", "Orange", "Papaya", "Quince"},
    People = {"James", "Mary", "John", "Patricia", "Robert", "Jennifer", "Michael", "Linda", "William", "Elizabeth", "David", "Barbara", "Richard", "Susan", "Joseph", "Jessica", "Thomas", "Sarah", "Charles", "Karen"},
    StartDate = #date(2019, 1, 1),
    EndDate = #date(2024, 12, 31),
    DaysBetween = Duration.Days(EndDate - StartDate) + 1,

    // Function to generate random dates
    RandomDate = () => Date.From(StartDate + #duration(Number.RoundDown(Number.RandomBetween(0, DaysBetween)), 0, 0, 0)),

    // Generate table records
    Records = List.Transform({1..1000}, each [
        Country = Countries{Number.Mod(_, List.Count(Countries))},
        Product = Products{Number.Mod(_, List.Count(Products))},
        Person = People{Number.Mod(_, List.Count(People))},
        Date = RandomDate(),
        Sales = Number.RoundDown(Number.RandomBetween(0, 500))
    ]),

    // Convert list of records to table and set data types
    Source = Table.FromRecords(Records),
    TypedTable = Table.TransformColumnTypes(Source, {
        {"Country", type text},
        {"Product", type text},
        {"Person", type text},
        {"Date", type date},
        {"Sales", Int32.Type}
    })
in
    TypedTable

And here’s an example of the output:

Definitely a time saver as far as I’m concerned. Is it totally reliable? No: it occasionally produces code that errors or which doesn’t contain genuinely random values, but it’s good enough and it’s faster to try the prompt once or twice than write the code myself. I know there are other, more sophisticated ways of generating sample data like this in Fabric, for example in Python, but as I said I’m a Power Query person.

And of course, for bonus points, we can now send the output of a Dataflow Gen2 to a CSV file in SharePoint which makes this even more useful:

Speed Up Dataflow Publishing/ Validation Times In Power BI And Fabric

If you’re working with slow data sources in Power BI/Fabric dataflows then you’re probably aware that validation (for Gen1 dataflows) or publishing (for Gen2 dataflows) them can sometimes take a long time. If you’re working with very slow data sources then you may run into the 10 minute timeout on validation/publishing that is documented here. For a Gen1 dataflow you’ll see the following error message if you try to save your dataflow and validation takes more than 10 minutes:

Failed to analyze issues in the query

For a Gen2 Dataflow, where you can save the Dataflow and publishing takes place in the background, you’ll see the following error in your workspace:

Dataflow publish failed

Apart from tuning your data source and tuning your queries, what can you do about this? Well one of the things that happens when you publish a dataflow is that it works out the columns returned, and the data types of those columns, for all of the queries in the dataflow. It does this by trying to run the queries until they return data by applying a top 0 row filter to them; if you can make that faster then validation/publishing will be faster. Obviously query folding is important here because that top 0 filter should fold, as are more obscure, source-specific settings like this one for ODBC sources. However, there is another trick that you can use if you are happy writing some moderately complicated M code – the trick I blogged about here for making Power Query in Power BI Desktop faster.

Let’s see an example with Dataflows Gen2. Conside the following M code which returns a table with three columns and is deliberately written to take 11 minutes and 1 second to return (see this post for more details on how to create artificially slow Power Query queries).

let
  Source = Function.InvokeAfter(
    () => 
    #table(
      type table
      [
        #"Number Column"=number,
        #"Text Column"=text,
        #"Date Column"=date
      ],
      {
        {1,"Hello",#date(2016,1,1)},
        {2,"World",#date(2017,12,12)}
      }
      )
    
    , 
    #duration(0, 0, 11, 1)
  )
in
  Source

As you would expect, trying to publish a Gen1 or Gen2 dataflow that uses this query will fail because it takes more than 10 minutes before it returns any rows. However in this case – as in most cases – you know what columns the query returns so it’s possible to use the Table.View M function to intercept the zero-row filter applied during validation/publishing and return a table with no rows in and the columns that the query above returns. You can do this by adding two extra steps in the M code like so:

let
  Source = Function.InvokeAfter(
    () => 
    #table(
      type table
      [
        #"Number Column"=number,
        #"Text Column"=text,
        #"Date Column"=date
      ],
      {
        {1,"Hello",#date(2016,1,1)},
        {2,"World",#date(2017,12,12)}
      }
      )
    
    , 
    #duration(0, 0, 11, 1)
  ),
  TableTypeToReturn = 
    type table
      [
        #"Number Column"=number,
        #"Text Column"=text,
        #"Date Column"=date
      ],
  OverrideZeroRowFilter = Table.View(
    null, 
    [
    GetType = () => 
      TableTypeToReturn, 
    GetRows = () => 
      Source, 
    OnTake = (count as number) => 
      if count = 0 then 
      #table(
        TableTypeToReturn, 
        {}
      ) 
    else 
     Table.FirstN(Source, count)]
  )
in
  OverrideZeroRowFilter

The first step added here, called TableTypeToReturn, defines the columns and data types of the table returned by the query; if you use this technique yourself, you will need to alter it so it returns the columns and data types of your query. You can read more about #table and table types here and I have a function that will automatically generate this code from an existing query for you here. The second step, called OverrideZeroRowFilter, looks for situations where a Top N filter is being applied and if N=0 returns a table of the type defined in the previous step with zero rows. For a more detailed explanation see that original blog post.

This new version of the query validates/publishes immediately, although it still takes 11 minutes and 1 second to refresh. Of course if you use this technique and then change your query so that different columns or data types are returned you have to update the extra code every time, which can be fiddly, but if you’re running into a timeout then you don’t have any choice and even if validation/publishing is slow it’s probably worth the extra effort.

Fabric Dataflows Gen2: To Stage Or Not To Stage?

If you read this post that was published on the Fabric blog back in July, you’ll know that each Power Query query in a Fabric Gen2 dataflow has a property that determines whether its output is staged or not – where “staged” means that the output is written to the (soon-to-be hidden) Lakehouse linked to the dataflow, regardless of whether you have set a destination for the query output to be written to. Turning this on or off can have a big impact on your refresh times, making them a lot faster or a lot slower. You can find this property by right-clicking on the query name in the Queries pane:

At the moment this property is on by default for every query although this may change in the future. But should you turn it on for the queries in your Gen2 dataflows? It depends, and you should test to see what gives you the best performance.

Let’s see a simple example. I uploaded a CSV file from my favourite data source, the Land Registry price paid data, with about a million rows in it to the files section of a Lakehouse, then created a query that did a group by on one of the columns to find the number of property transactions by each county in England and Wales. The query was set to load its output to a table in a Warehouse.

Here’s the diagram view for this query:

I then made sure that staging was turned off for this query:

This means that the Power Query engine did the group by itself as it read the data from the file.

Looking at the refresh history for this dataflow:

…showed that the query took between 18-24 seconds to run. Clicking on an individual refresh to see the details:

…showed a single activity to load the output to the Warehouse. Clicking on this activity to see more details:

…shows how long it took – 15 seconds – plus how many rows were loaded to the destination Warehouse and how much data.

I then created a second dataflow to see the effect of staging. It’s important to understand that copying the previous dataflow and enabling staging on the only query in it does not do what I wanted here: I had to create two queries, one with staging enabled (called PP here) and no destination set to stage all the raw data from the CSV file, and a second one (called Counties here) that references the first with staging disabled and its destination set to the Warehouse I used in the previous dataflow to do the group by.

Here’s the diagram view for these two queries:

Note the blue outline on the PP query which indicates that it’s staged and the grey outline on the Counties query that indicates that it is not staged.

Looking at the Refresh History for this dataflow showed that it took around 40 seconds to run on average:

Looking at the first level of detail for the last refresh showed the extra activity for staging the data:

Clicking on the details for this staging activity for the PP table showed that it took 17 seconds to load all the raw data:

The activity to write the data to the Warehouse took about the same as with the first dataflow:

In summary, the first dataflow clearly performs better than the second dataflow. In this case, therefore, it looks like the overhead of staging the data made the performance worse.

Don’t take this simple example to prove a general rule: every dataflow will be different and there are a lot of performance optimisations planned for Dataflows Gen2 over the next few months, so you should test the impact of staging for yourself. I can imagine for different data sources (a Lakehouse source is likely to perform very well, even for files) and different transformations then staging will have a positive impact. On the other hand if you’re struggling with Dataflows Gen2 performance, especially at the time of writing this post, turning off staging could lead to a performance improvement.

Power Query Templates In Excel And Fabric

Earlier this year a new feature was added to Power Query in Excel and Dataflows that very few people noticed: the abilty to export all the queries in the Power Query Editor to a template file. The reason few people noticed it was that, when the feature released, the only place you could import a template file was in a Power Platform dataflow. The documentation for exporting from Excel and importing to a Power Platform dataflow is very detailed and can be found here:

https://learn.microsoft.com/en-us/power-query/new-dataflow-from-template

With the release of Fabric you can now import templates into Gen2 Dataflows. This means you can export a template from a Gen1 dataflow and import it into a Gen2 dataflow, which is documented here:

https://learn.microsoft.com/en-us/fabric/data-factory/move-dataflow-gen1-to-dataflow-gen2

Joining the dots, this also means you can now export a template from Power Query in Excel and import it to a Gen2 dataflow. This makes it easy to develop on your PC before pushing your work up to Fabric. Here’s a quick walkthrough of how to do this.

In Excel Power Query let’s say you have one or more queries in the Power Query Editor:

If you then go to the File menu in the top left corner of the screen you’ll see the option to export to a template:

If you click this menu item, you’ll see a dialog where you can enter a name and description for the template:

Click Export and you can save the template to a .pqt file.

If you then create a new Gen2 Dataflow in Fabric then, in the middle of the screen, you’ll see the option to import from a template:

[Notice also the Export Template option in the top right corner]

If you import the .pqt file created in Excel the queries in it will get created in the dataflow. You may need to re-enter credentials:

Once you’ve done that, the queries run:

Template files are the way the cool kids share their Power Query queries these days and no kid is cooler than my colleague Alex Powers who, I saw on Twitter recently, has created a template file that scrapes all the ideas from the Fabric Ideas site:

[If you do import this template close to the time of writing this post you may run into this bug from the list of known issues for Fabric Data Factory and Dataflows]

Hopefully we’ll start to see more and more useful Power Query queries and functions shared as template files!

Monitoring Power Query Online Memory And CPU Usage

Power Query Online is, as the name suggests, the online version of Power Query – it’s what you use when you’re developing Power BI Dataflows for example. Sometimes when you’re building a complex, slow query in the Query Editor you’ll notice a message in the status bar at the bottom of the page telling you how long the query has been running for and how much memory and CPU it’s using:

The duration and CPU values are straightforward, but what does the memory value actually represent? It turns out it’s the “Commit (Bytes)” value documented here for Query Diagnostics, that’s to say the amount of virtual memory being used by the query. That’s different to the “Working Set (Bytes)” value which is the amount of physical memory used by the query, and which is not visible anywhere. For a more detailed discussion of these values in Power Query in Power BI Desktop see this post. The maximum commit or working set for a query evalation in Power Query Online isn’t officially documented anywhere (and may change) but I can say three things:

  1. The maximum commit is larger than the maximimum working set.
  2. If Power Query Online uses more than the maximum working set then query evaluation will get slow, so if your query uses a lot of memory (say, over 1GB – I suspect you’ll only see this message if it is using a lot of memory…) then you need to do some tuning to reduce it. Probably the best way to do this is to look at the query plan for your dataflow and try to avoid any operations marked as “Full Scan”, as documented here.
  3. If your query uses more than the maximum commit then it may get cancelled and you’ll see an error (note that the maximum time a query evaluation can run for in Power Query Online anyway is 10 minutes, which is documented here).

[Thanks to Jorge Gomez Basanta for this information]

Three New Power BI/Power Query Books

I decided to stop writing book reviews here on my blog a long time ago: it’s a lot of work to read a book and write a proper, detailed review and what’s more I don’t like the idea of writing a bad review and upsetting someone who has gone to all the effort of writing a book. That said, from time to time I get given free copies of books (which I’m always happy to receive – I like to see how other people go about explaining Power BI concepts and functionality) and in return I give the authors some free publicity here. Recently I received two copies of new books from people that I know:

Expert data modeling with Power BI, by Soheil Bakhshi (Buy it here on Amazon UK)

Soheil is an MVP whose blog I have read and admired for some time so I’m pleased to see he has written a book. It’s an important subject too: good data modelling is key to success with Power BI, and the problems of many customers I work with stem from not taking the time to learn how data should be modelled for Power BI. This book introduces you to concepts like dimensional modelling and star schemas and shows you how to build datasets that follow best practices. It also covers topics such as calculation groups and object-level security that won’t be in older books.

Power Query cookbook, by Andrea Janicijevic (Buy it here on Amazon UK)

Andrea is a colleague of mine at Microsoft and of course Power Query is a technology close to my heart. This book follows the cookbook format which teaches through a series of worked examples and easy-to-follow steps; anyone learning Power Query will find it useful to follow these recipes to get practice creating queries. I liked the inclusion of Power BI Dataflows as well as Power Query in Power BI Desktop, and again this book has the advantage of being new – it covers recently-added features such as Schema View and Diagram View in Dataflows and Query Diagnostics in Power BI Desktop that won’t be covered in other books.

There’s another book I was curious about and was lucky enough to be able to read via Microsoft’s online library for employees:

Pro Power BI theme creation, by Adam Aspin (Buy it here on Amazon UK)

When I hear someone had written a book about Power BI theme files I couldn’t believe it, but Adam is an experienced writer and has pulled it off. As you might expect it’s everything you ever wanted to learn about Power BI themes and as such, if themes are something you’re interested in you should read this book. It explains how theme files are structured, how to edit them and how the various attributes are applied to different visuals.

How Query Folding And The New Power BI Dataflows Connector Can Help Dataset Refresh Performance

You may have noticed that a new dataflows connector was announced in the August 2021 release of Power BI Desktop, and that it now supports query folding between a dataset and a dataflow – which you may be surprised to learn was not possible before. In this post I thought I’d take a look at how much of an improvement in performance this can make to dataset refresh performance.

For my tests I created a new PPU workspace and a dataflow, and made sure the Enhanced Compute Engine was turned on for the dataflow on the Settings page:

Query folding will only happen if the Enhanced Compute Engine is set to “On”, and won’t happen with the “Optimized” setting. The Enhanced Compute Engine is only available with PPU and Premium.

For my data source I used a CSV file with a million rows in and seven integer columns. I then created two tables in my dataflow like so:

The Source table simply connects to the CSV file, uses the first row as the headers, then sets the data type on each column. The second table called Output – which contains no tranformations at all – is needed for the data to be stored in the Enhanced Compute Engine, and the lightning icon in the top-left corner of the table in the diagram shows this is the case.

Next, in Power BI Desktop, I created a Power Query query that used the old Power BI dataflows connector:

If you have any existing datasets that connect to dataflows, this is the connector you will have used – it is based on the PowerBI.Dataflows function. My query connected to the Output table and filtered the rows to where column A is less than 100. Here’s the M code, slightly edited to remove all the ugly GUIDs:

let
    Source = PowerBI.Dataflows(null),
    ws = Source{[workspaceId="xxxx"]}[Data],
    df = ws{[dataflowId="yyyy"]}[Data],
    Output1 = df{[entity="Output"]}[Data],
    #"Filtered Rows" = Table.SelectRows(Output1, each [A] < 100)
in
    #"Filtered Rows"

Remember, this connector does not support query folding. Using this technique to measure how long the query ran when the results from the query were loaded into the dataset, I could see it took almost 12.5 seconds to get the data for this query:

In fact the performance in Desktop is worse: when refresh was taking place, I could see Power BI downloading 108MB of data even though the original source file is only 54MB.

Why is the data downloaded twice? I strongly suspect it’s because of this issue – because, of course, no query folding is happening. So the performance in Desktop is really even worse.

I then created the same query with the new dataflows connector:

This connector uses the PowerPlatform.Dataflows function; it’s not new, but what is new is that you can now access Power BI dataflows using it.

Here’s the M code, again cleaned up to remove GUIDS:

let
    Source = PowerPlatform.Dataflows(null),
    Workspaces = Source{[Id="Workspaces"]}[Data],
    ws = Workspaces{[workspaceId="xxxx"]}[Data],
    df = ws{[dataflowId="yyyy"]}[Data],
    Output_ = df{[entity="Output",version=""]}[Data],
    #"Filtered Rows" = Table.SelectRows(Output_, each [A] < 100)
in
    #"Filtered Rows"

When this query was loaded into the dataset, it only took 4 seconds:

This is a lot faster, and Power BI Desktop was a lot more responsive during development too.

It’s reasonable to assume that query folding is happening in this query and the filter on [A]<100 is now taking place inside the Enhanced Compute Engine rather than in Power BI Desktop. But how can you be sure query folding is happening? The “View Native Query” option is greyed out, but of course this does not mean that query folding is not happening. However, if you use Query Diagnostics, hidden away in the Data Source Query column of the detailed diagnostics query, you can see a SQL query with the WHERE clause you would expect:

In conclusion, you can see that the new dataflows connector can give you some big improvements for dataset refresh performance and a much better development experience in Power BI Desktop. Query folding support also means that you can now use dataset incremental refresh when using a dataflow as a source. However, you will need to use Premium or PPU, you may also need to make some changes to your dataflow to make sure it can take advantage of the Enhanced Compute Engine, and you will also need to update any existing Power Query queries to use the new connector. I think the potential performance gains are worth making these changes though. If you do make these changes in your dataflows and find that it helps, please leave a comment!