How Query Folding And The New Power BI Dataflows Connector Can Help Dataset Refresh Performance

You may have noticed that a new dataflows connector was announced in the August 2021 release of Power BI Desktop, and that it now supports query folding between a dataset and a dataflow – which you may be surprised to learn was not possible before. In this post I thought I’d take a look at how much of an improvement in performance this can make to dataset refresh performance.

For my tests I created a new PPU workspace and a dataflow, and made sure the Enhanced Compute Engine was turned on for the dataflow on the Settings page:

Query folding will only happen if the Enhanced Compute Engine is set to “On”, and won’t happen with the “Optimized” setting. The Enhanced Compute Engine is only available with PPU and Premium.

For my data source I used a CSV file with a million rows in and seven integer columns. I then created two tables in my dataflow like so:

The Source table simply connects to the CSV file, uses the first row as the headers, then sets the data type on each column. The second table called Output – which contains no tranformations at all – is needed for the data to be stored in the Enhanced Compute Engine, and the lightning icon in the top-left corner of the table in the diagram shows this is the case.

Next, in Power BI Desktop, I created a Power Query query that used the old Power BI dataflows connector:

If you have any existing datasets that connect to dataflows, this is the connector you will have used – it is based on the PowerBI.Dataflows function. My query connected to the Output table and filtered the rows to where column A is less than 100. Here’s the M code, slightly edited to remove all the ugly GUIDs:

let
    Source = PowerBI.Dataflows(null),
    ws = Source{[workspaceId="xxxx"]}[Data],
    df = ws{[dataflowId="yyyy"]}[Data],
    Output1 = df{[entity="Output"]}[Data],
    #"Filtered Rows" = Table.SelectRows(Output1, each [A] < 100)
in
    #"Filtered Rows"

Remember, this connector does not support query folding. Using this technique to measure how long the query ran when the results from the query were loaded into the dataset, I could see it took almost 12.5 seconds to get the data for this query:

In fact the performance in Desktop is worse: when refresh was taking place, I could see Power BI downloading 108MB of data even though the original source file is only 54MB.

Why is the data downloaded twice? I strongly suspect it’s because of this issue – because, of course, no query folding is happening. So the performance in Desktop is really even worse.

I then created the same query with the new dataflows connector:

This connector uses the PowerPlatform.Dataflows function; it’s not new, but what is new is that you can now access Power BI dataflows using it.

Here’s the M code, again cleaned up to remove GUIDS:

let
    Source = PowerPlatform.Dataflows(null),
    Workspaces = Source{[Id="Workspaces"]}[Data],
    ws = Workspaces{[workspaceId="xxxx"]}[Data],
    df = ws{[dataflowId="yyyy"]}[Data],
    Output_ = df{[entity="Output",version=""]}[Data],
    #"Filtered Rows" = Table.SelectRows(Output_, each [A] < 100)
in
    #"Filtered Rows"

When this query was loaded into the dataset, it only took 4 seconds:

This is a lot faster, and Power BI Desktop was a lot more responsive during development too.

It’s reasonable to assume that query folding is happening in this query and the filter on [A]<100 is now taking place inside the Enhanced Compute Engine rather than in Power BI Desktop. But how can you be sure query folding is happening? The “View Native Query” option is greyed out, but of course this does not mean that query folding is not happening. However, if you use Query Diagnostics, hidden away in the Data Source Query column of the detailed diagnostics query, you can see a SQL query with the WHERE clause you would expect:

In conclusion, you can see that the new dataflows connector can give you some big improvements for dataset refresh performance and a much better development experience in Power BI Desktop. Query folding support also means that you can now use dataset incremental refresh when using a dataflow as a source. However, you will need to use Premium or PPU, you may also need to make some changes to your dataflow to make sure it can take advantage of the Enhanced Compute Engine, and you will also need to update any existing Power Query queries to use the new connector. I think the potential performance gains are worth making these changes though. If you do make these changes in your dataflows and find that it helps, please leave a comment!

29 thoughts on “How Query Folding And The New Power BI Dataflows Connector Can Help Dataset Refresh Performance

  1. hi Chris, thanks for sharing your knowledge,

    Today some companies face a big issue regarding premium capacity due to overhead and utilization.

    A large part of the impact comes from refresh time and many refreshing datasets in the same time.

    as far as I know, a normal dataflow don’t use hot memory from capacity over time. When I leave many dataflows with ECE enabled is there a constant memory consumption of the capacity even if you don’t have a dataset refreshing that are using them as a source (don’t consider directquery to dataflows)?

  2. Thanks for explaining the advantages of the new connector, Chris! Great stuff as always. I’m curious, is updating a dataset that’s using the older connector just as simple as replacing “PowerBI.Dataflows” in the Source section with “PowerPlatform.Dataflows”, then close and apply?… or is the process different than that? Cheers!

  3. On my side, swapping the legacy connector with this one has some limitations:
    1. The connection from PBI Desktop is super slow;
    2. Loading Date-only fields from a dataflow converts them to a DateTime field, forcing you to run an extra datatype conversion (I don’t remember though if such conversions break query folding)

    To me, I reverted to legacy dataflows and will consider migrating again once the code base is stable enough.

    1. Okay, we’ve found a performance problem mostly limited to the editor where the “top N” operator isn’t being folded to the compute back-end. This can result in a significantly worse editing experience. The fix will be in the October release, and we’re going to try to port it to the September release as a QFE as well.

  4. Hi Chris,

    Well, I noticed at least one “regression” compared to the legacy PBI dataflow connector: Date-only records are imported as Datetime (i.e. records are all suffixed with 12:00:00 AM).

    While adding an extra data type conversion at the M level within PBI desktop solves it, what’s the point? Also, I don’t remember if such conversion breaks folding or not.

    I am therefore reverting all my dataflows to the legacy connector, which does not have this issue. I will wait for a more stable code base.

    1. We should be able to fix this for the October release. That said, if you’re just loading the data into some destination — whether a data model or another data flow — it should load correctly as a date because the column type is correct and loading is driven by the column type and not the cell types.

  5. I agree with Fred S., the new connector is horribly slow. In my test case with about 1.2 million rows and about 50 columns, it’s bad to the point of unusable. Even the preview in Power Query takes ages to complete.

  6. Thanks for the article.

    I’ve experienced similar issues to some of the commenters, where loading data with the new Dataflows connector via PowerBI desktop does not even complete for me.

    I have about 20 million rows and after 20 minutes of attempting to connect to PowerPlatformDataflows, I receive “An error happened while reading data from the provider: ‘Cannot write more bytes to the buffer than the configured maximum buffer size. 2147483647’. I attempted restricting the data with RangeStart and Range end parameters, but even one day of data times out.

    Any help is much appreciated. Thank you

  7. The error “Cannot write more bytes to the buffer than the configured maximum buffer size. 2147483647” is a known issue in the August release of Power BI Desktop and the fix will be available in the September release.

  8. FYI the Lineage View works correctly in Power BI workspace when using the Power BI Dataflows connector. When using the Power Platofrm Dataflows connector, lineage view stops working correctly…Please fix Microsfot!

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      Yes, this is a known issue unfortunately. We don’t have a timeline for the fix yet.

  9. Have you tried to use this new connector in pipelines ?

    For me, it doesn’t change the connection between stages.
    Same problem that the lineage ??

    It was ok with the older one.

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      I think this is another known issue with the new connector that we’re working on fixing.

  10. The connector now seems useable in Excel by starting a blank query and typing =PowerPlatform.Dataflows()

    However, it does not have any query folding. Anybody know when that will be resolved?

  11. Seems like neither PowerBI.Dataflows() or PowerPlatform.Dataflows() can achieve query folding. PowerPlatform.Dataflows() is indeed a lot faster, but using PowerPlatform.Dataflows() in the Power BI service shows that any navigation steps (like choosing a workspace or dataflow) results in the query not being able to be folded.

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      Don’t worry about the steps that choose a workspace or dataflow. I don’t even know if the step folding indicators work properly with this connector yet.

  12. Hi Chris, I’ve been experimenting with the dataflows connector with a large dataset (over 20million rows) to take advantage of query folding. Sometimes I get an error message part way through when importing the data. The error is:

    “Failed to save modifications to the server. Error returned: ‘OLE DB or ODBC error: [DataSource.Error] An error happened while reading data from the provider: ‘Internal error DirectQueryUnexpectedEndOfStream.’.’.

    With some queries where I’ve selected limited columns and narrower date ranges I’ve retried and it’s worked successfully. However, when I’ve tried to download a dataset with around 10 million rows and 30 columns it is failing every time. Are you familiar with this message from this connector and do you have an ideas how I can dig into what the problem is? I haven’t seen it with the standard Power BI dataflows connector before but we are keen to take advantage of the query folding.

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      This is a known issue that we’re working on fixing

      1. Thanks for the reply Chris, that’s useful to know it’s working on being fixed. Do you have any rough timelines on how soon it will be resolved?

      2. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
        Chris Webb says:

        I don’t, sorry

    2. Hi Chris! Thanks for the blog. I’m suffering the same ‘DirectQueryUnexpectedEndOfStream’ message that Paul mentioned. Even I was tracking it with a MS support team. Do you know any update about it? Or if there is some way to avoid it? I’ve detected that this issue occurs when a client (dataflow or power bi) tries to send “foldable” M steps. When there is nothing, the data refresh ok. Thanks!

  13. Great article.
    Now that dataflows has made it to Excel is there anyway to use queryfolding when joining a dataflow to a table of values entered by the user?

    We have a dataflow with about 1.5 million records, this is imported to Excel and merged to a table in the Excel file populated by the user (inner join). It works but is very slow as the full dataset has to be downloaded before the merge limits the results to the small amount the user needs.

    1. Hi Chris! Thanks for the blog. I’m suffering the same ‘DirectQueryUnexpectedEndOfStream’ message that Paul mentioned. Even I was tracking it with a MS support team. Do you know any update about it? Or if there is some way to avoid it? I’ve detected that this issue occurs when a client (dataflow or power bi) tries to send “foldable” M steps. When there is nothing, the data refresh ok. Thanks!

Leave a Reply to Fred S.Cancel reply