How Query Folding And The New Power BI Dataflows Connector Can Help Dataset Refresh Performance

You may have noticed that a new dataflows connector was announced in the August 2021 release of Power BI Desktop, and that it now supports query folding between a dataset and a dataflow – which you may be surprised to learn was not possible before. In this post I thought I’d take a look at how much of an improvement in performance this can make to dataset refresh performance.

For my tests I created a new PPU workspace and a dataflow, and made sure the Enhanced Compute Engine was turned on for the dataflow on the Settings page:

Query folding will only happen if the Enhanced Compute Engine is set to “On”, and won’t happen with the “Optimized” setting. The Enhanced Compute Engine is only available with PPU and Premium.

For my data source I used a CSV file with a million rows in and seven integer columns. I then created two tables in my dataflow like so:

The Source table simply connects to the CSV file, uses the first row as the headers, then sets the data type on each column. The second table called Output – which contains no tranformations at all – is needed for the data to be stored in the Enhanced Compute Engine, and the lightning icon in the top-left corner of the table in the diagram shows this is the case.

Next, in Power BI Desktop, I created a Power Query query that used the old Power BI dataflows connector:

If you have any existing datasets that connect to dataflows, this is the connector you will have used – it is based on the PowerBI.Dataflows function. My query connected to the Output table and filtered the rows to where column A is less than 100. Here’s the M code, slightly edited to remove all the ugly GUIDs:

let
    Source = PowerBI.Dataflows(null),
    ws = Source{[workspaceId="xxxx"]}[Data],
    df = ws{[dataflowId="yyyy"]}[Data],
    Output1 = df{[entity="Output"]}[Data],
    #"Filtered Rows" = Table.SelectRows(Output1, each [A] < 100)
in
    #"Filtered Rows"

Remember, this connector does not support query folding. Using this technique to measure how long the query ran when the results from the query were loaded into the dataset, I could see it took almost 12.5 seconds to get the data for this query:

In fact the performance in Desktop is worse: when refresh was taking place, I could see Power BI downloading 108MB of data even though the original source file is only 54MB.

Why is the data downloaded twice? I strongly suspect it’s because of this issue – because, of course, no query folding is happening. So the performance in Desktop is really even worse.

I then created the same query with the new dataflows connector:

This connector uses the PowerPlatform.Dataflows function; it’s not new, but what is new is that you can now access Power BI dataflows using it.

Here’s the M code, again cleaned up to remove GUIDS:

let
    Source = PowerPlatform.Dataflows(null),
    Workspaces = Source{[Id="Workspaces"]}[Data],
    ws = Workspaces{[workspaceId="xxxx"]}[Data],
    df = ws{[dataflowId="yyyy"]}[Data],
    Output_ = df{[entity="Output",version=""]}[Data],
    #"Filtered Rows" = Table.SelectRows(Output_, each [A] < 100)
in
    #"Filtered Rows"

When this query was loaded into the dataset, it only took 4 seconds:

This is a lot faster, and Power BI Desktop was a lot more responsive during development too.

It’s reasonable to assume that query folding is happening in this query and the filter on [A]<100 is now taking place inside the Enhanced Compute Engine rather than in Power BI Desktop. But how can you be sure query folding is happening? The “View Native Query” option is greyed out, but of course this does not mean that query folding is not happening. However, if you use Query Diagnostics, hidden away in the Data Source Query column of the detailed diagnostics query, you can see a SQL query with the WHERE clause you would expect:

In conclusion, you can see that the new dataflows connector can give you some big improvements for dataset refresh performance and a much better development experience in Power BI Desktop. Query folding support also means that you can now use dataset incremental refresh when using a dataflow as a source. However, you will need to use Premium or PPU, you may also need to make some changes to your dataflow to make sure it can take advantage of the Enhanced Compute Engine, and you will also need to update any existing Power Query queries to use the new connector. I think the potential performance gains are worth making these changes though. If you do make these changes in your dataflows and find that it helps, please leave a comment!

14 responses

  1. Pingback: Improving Dataset Refresh with Query Folding and the Dataflows Connector – Curated SQL

  2. hi Chris, thanks for sharing your knowledge,

    Today some companies face a big issue regarding premium capacity due to overhead and utilization.

    A large part of the impact comes from refresh time and many refreshing datasets in the same time.

    as far as I know, a normal dataflow don’t use hot memory from capacity over time. When I leave many dataflows with ECE enabled is there a constant memory consumption of the capacity even if you don’t have a dataset refreshing that are using them as a source (don’t consider directquery to dataflows)?

  3. Thanks for explaining the advantages of the new connector, Chris! Great stuff as always. I’m curious, is updating a dataset that’s using the older connector just as simple as replacing “PowerBI.Dataflows” in the Source section with “PowerPlatform.Dataflows”, then close and apply?… or is the process different than that? Cheers!

  4. Hi Chris,
    It’s nice to have Query folding available for Dataflow but Dataset Incremental refresh is not working when some of the old existing records getting updated.

  5. On my side, swapping the legacy connector with this one has some limitations:
    1. The connection from PBI Desktop is super slow;
    2. Loading Date-only fields from a dataflow converts them to a DateTime field, forcing you to run an extra datatype conversion (I don’t remember though if such conversions break query folding)

    To me, I reverted to legacy dataflows and will consider migrating again once the code base is stable enough.

    • Okay, we’ve found a performance problem mostly limited to the editor where the “top N” operator isn’t being folded to the compute back-end. This can result in a significantly worse editing experience. The fix will be in the October release, and we’re going to try to port it to the September release as a QFE as well.

  6. Hi Chris,

    Well, I noticed at least one “regression” compared to the legacy PBI dataflow connector: Date-only records are imported as Datetime (i.e. records are all suffixed with 12:00:00 AM).

    While adding an extra data type conversion at the M level within PBI desktop solves it, what’s the point? Also, I don’t remember if such conversion breaks folding or not.

    I am therefore reverting all my dataflows to the legacy connector, which does not have this issue. I will wait for a more stable code base.

    • We should be able to fix this for the October release. That said, if you’re just loading the data into some destination — whether a data model or another data flow — it should load correctly as a date because the column type is correct and loading is driven by the column type and not the cell types.

    • Does the new connector show loading progress after clicking close & apply (i.e how many rows or mb that is loaded) ?

  7. Pingback: Connecting to a Dataflow by Name - Greyskull Analytics

  8. I agree with Fred S., the new connector is horribly slow. In my test case with about 1.2 million rows and about 50 columns, it’s bad to the point of unusable. Even the preview in Power Query takes ages to complete.

  9. Thanks for the article.

    I’ve experienced similar issues to some of the commenters, where loading data with the new Dataflows connector via PowerBI desktop does not even complete for me.

    I have about 20 million rows and after 20 minutes of attempting to connect to PowerPlatformDataflows, I receive “An error happened while reading data from the provider: ‘Cannot write more bytes to the buffer than the configured maximum buffer size. 2147483647’. I attempted restricting the data with RangeStart and Range end parameters, but even one day of data times out.

    Any help is much appreciated. Thank you

  10. The error “Cannot write more bytes to the buffer than the configured maximum buffer size. 2147483647” is a known issue in the August release of Power BI Desktop and the fix will be available in the September release.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: