If you read this post that was published on the Fabric blog back in July, you’ll know that each Power Query query in a Fabric Gen2 dataflow has a property that determines whether its output is staged or not – where “staged” means that the output is written to the (soon-to-be hidden) Lakehouse linked to the dataflow, regardless of whether you have set a destination for the query output to be written to. Turning this on or off can have a big impact on your refresh times, making them a lot faster or a lot slower. You can find this property by right-clicking on the query name in the Queries pane:
At the moment this property is on by default for every query although this may change in the future. But should you turn it on for the queries in your Gen2 dataflows? It depends, and you should test to see what gives you the best performance.
Let’s see a simple example. I uploaded a CSV file from my favourite data source, the Land Registry price paid data, with about a million rows in it to the files section of a Lakehouse, then created a query that did a group by on one of the columns to find the number of property transactions by each county in England and Wales. The query was set to load its output to a table in a Warehouse.
Here’s the diagram view for this query:
I then made sure that staging was turned off for this query:
This means that the Power Query engine did the group by itself as it read the data from the file.
Looking at the refresh history for this dataflow:
…showed that the query took between 18-24 seconds to run. Clicking on an individual refresh to see the details:
…showed a single activity to load the output to the Warehouse. Clicking on this activity to see more details:
…shows how long it took – 15 seconds – plus how many rows were loaded to the destination Warehouse and how much data.
I then created a second dataflow to see the effect of staging. It’s important to understand that copying the previous dataflow and enabling staging on the only query in it does not do what I wanted here: I had to create two queries, one with staging enabled (called PP here) and no destination set to stage all the raw data from the CSV file, and a second one (called Counties here) that references the first with staging disabled and its destination set to the Warehouse I used in the previous dataflow to do the group by.
Here’s the diagram view for these two queries:
Note the blue outline on the PP query which indicates that it’s staged and the grey outline on the Counties query that indicates that it is not staged.
Looking at the Refresh History for this dataflow showed that it took around 40 seconds to run on average:
Looking at the first level of detail for the last refresh showed the extra activity for staging the data:
Clicking on the details for this staging activity for the PP table showed that it took 17 seconds to load all the raw data:
The activity to write the data to the Warehouse took about the same as with the first dataflow:
In summary, the first dataflow clearly performs better than the second dataflow. In this case, therefore, it looks like the overhead of staging the data made the performance worse.
Don’t take this simple example to prove a general rule: every dataflow will be different and there are a lot of performance optimisations planned for Dataflows Gen2 over the next few months, so you should test the impact of staging for yourself. I can imagine for different data sources (a Lakehouse source is likely to perform very well, even for files) and different transformations then staging will have a positive impact. On the other hand if you’re struggling with Dataflows Gen2 performance, especially at the time of writing this post, turning off staging could lead to a performance improvement.