Stop Query Folding With The Table.StopFolding M Function

[This post was originally published on the official Power Query blog, which has now been taken down. I’m republishing all my posts there to this blog to ensure the content remains available.]

Query folding is generally a good thing in Power Query: pushing transformations back to the data source means they almost always perform better. That’s not always the case though and sometimes you need to stop query folding taking place to improve performance. In the past there has been no easy way to do this: you could use the Table.Buffer M function but this also buffers an entire table into memory, which can lead to other, different performance problems; you could also add a transformation that you know doesn’t fold, such as adding an index column to your table, but again this could involve a performance penalty and what’s more if future versions of Power Query are able to fold the transformation then this approach will not work anymore. Now, however, there is a new M function called Table.StopFolding that is guaranteed to stop query folding taking place with no other side effects.

Here’s a simple example of how to use it. Consider the following M query which connects to SQL Server, gets data from the DimProductCategory table in the AdventureWorksDW2017 database and filters it so only the rows where the ProductCategoryKey column is greater than two are returned:

let
  Source = Sql.Database(
    "localhost", 
    "AdventureWorksDW2017"
  ), 
  dbo_DimProductCategory = Source
    {
      [
        Schema = "dbo", 
        Item   = "DimProductCategory"
      ]
    }
    [Data], 
  #"Filtered Rows" = Table.SelectRows(
    dbo_DimProductCategory, 
    each [ProductCategoryKey] > 2
  )
in
  #"Filtered Rows"

Here’s the SQL query generated by Power Query for this:

select [_].[ProductCategoryKey],
    [_].[ProductCategoryAlternateKey],
    [_].[EnglishProductCategoryName],
    [_].[SpanishProductCategoryName],
    [_].[FrenchProductCategoryName]
from [dbo].[DimProductCategory] as [_]
where [_].[ProductCategoryKey] > 2

As you can see, the filter has been folded and the WHERE clause of the SQL query contains the filter on ProductCategoryKey.

To stop query folding taking place for this filter transformation you can add an extra step to the M code of your query using the Table.StopFolding function like so:

let
  Source = Sql.Database(
    "localhost",
    "AdventureWorksDW2017"
  ),
  dbo_DimProductCategory = Source
    {
      [
        Schema = "dbo",
        Item   = "DimProductCategory"
      ]
    }
    [Data],
  PreventQueryFolding
    = Table.StopFolding(
    dbo_DimProductCategory
  ),
  #"Filtered Rows" = Table.SelectRows(
    PreventQueryFolding,
    each [ProductCategoryKey] > 2
  )
in
  #"Filtered Rows"

The SQL query generated is now as follows, with no WHERE clause:

select [$Ordered].[ProductCategoryKey],
    [$Ordered].[ProductCategoryAlternateKey],
    [$Ordered].[EnglishProductCategoryName],
    [$Ordered].[SpanishProductCategoryName],
    [$Ordered].[FrenchProductCategoryName]
from [dbo].[DimProductCategory] as [$Ordered]
order by [$Ordered].[ProductCategoryKey]

The query result is still the same but the filter is not longer being folded back to SQL Server. Instead all the data is being returned to Power Query and the filter is taking place there.

Understanding Power Query Query Diagnostics Data With The Diagnostics.ActivityID M Function

I was looking at the output of Power Query’s Query Diagnostics feature recently (again) and trying to understand it better. One of the more confusing aspects of it is the way that the Power Query engine may evaluate a query more than once during a single refresh. This is documented in the note halfway down this page, which says:

Power Query might perform evaluations that you may not have directly triggered. Some of these evaluations are performed in order to retrieve metadata so we can best optimize our queries or to provide a better user experience (such as retrieving the list of distinct values within a column that are displayed in the Filter Rows experience). Others might be related to how a connector handles parallel evaluations.

I came up with the following M query to illustrate this:

#table(
    type table 
    [#"Activity ID"=text], 
    {{Diagnostics.ActivityId()}}
    )

If you paste this code into a new blank query:

…you have a query that returns a table containing a single cell containing the text value returned by the Diagnostics.ActivityId M function, which I blogged about here. The output – copied from the Data pane of the main Power BI window – looks like this:

The Diagnostics.ActivityId function is interesting because it returns an identifier for the currently-running query evaluation, so in the table above the value in the Activity ID column is the identifier for the query that returned that table.

If you run a Query Diagnostics trace when refreshing this query, you’ll see that the Activity Id column of the Diagnostics_Detailed trace query contains evaluation identifier values:

The following query takes the output of a Diagnostics_Detailed trace query and gets just the unique values from the Id and Activity Id columns:

let
    Source = #"Diagnostics_Detailed_2022-04-24_19:40",
    #"Removed Other Columns" = Table.SelectColumns(Source,{"Id", "Activity Id"}),
    #"Removed Duplicates" = Table.Distinct(#"Removed Other Columns")
in
    #"Removed Duplicates"

This makes it easy to see that my query was actually evaluated (or at least partially evaluated) three times when I clicked refresh. Since the value in the Activity Id column for Id 4.10 matches the value in the table loaded into my dataset, I know that that was the evaluation that loaded my table into the dataset.

New Options For The Table.Buffer Function In Power Query

The March 2022 release of Power BI Desktop includes some new options for the Table.Buffer M function that can be set in a new second parameter. The options are of type BufferMode.Type; if you look at the built-in documentation for this type you’ll see the following:

The two allowed values are:

BufferMode.Eager: The entire value is immediately buffered in memory before continuing
BufferMode.Delayed: The type of the value is computed immediately but its contents are not buffered until data is needed, at which point the entire value is immediately buffered

Anyone with previous experience of Table.Buffer will see that BufferMode.Eager is the existing behaviour, but what is BufferMode.Delayed for?

It turns out that it’s there to make development faster. Consider the following M query that loads data from a CSV file with seven columns and a million rows in, and then uses Table.Buffer to buffer that table into memory:

let
  Source = Csv.Document(
    File.Contents(
      "C:\NumbersMoreColumns.csv"
    ),
    [
      Delimiter  = ",",
      Columns    = 7,
      Encoding   = 65001,
      QuoteStyle = QuoteStyle.None
    ]
  ),
  #"Promoted Headers"
    = Table.PromoteHeaders(
    Source,
    [PromoteAllScalars = true]
  ),
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Promoted Headers",
    {
      {"A", Int64.Type},
      {"B", Int64.Type},
      {"C", Int64.Type},
      {"D", Int64.Type},
      {"E", Int64.Type},
      {"F", Int64.Type},
      {"G", Int64.Type}
    }
  ),
  BufferTable = Table.Buffer(
    #"Changed Type"
  )
in
  BufferTable

When I refresh this query in Power BI Desktop on my PC I see the “Evaluating…” message for 20 seconds before the data starts to load:

If, however, I add the second parameter [BufferMode = BufferMode.Delayed] to Table.Buffer like so:

let
  Source = Csv.Document(
    File.Contents(
      "C:\NumbersMoreColumns.csv"
    ), 
    [
      Delimiter  = ",", 
      Columns    = 7, 
      Encoding   = 65001, 
      QuoteStyle = QuoteStyle.None
    ]
  ), 
  #"Promoted Headers"
    = Table.PromoteHeaders(
    Source, 
    [PromoteAllScalars = true]
  ), 
  #"Changed Type"
    = Table.TransformColumnTypes(
    #"Promoted Headers", 
    {
      {"A", Int64.Type}, 
      {"B", Int64.Type}, 
      {"C", Int64.Type}, 
      {"D", Int64.Type}, 
      {"E", Int64.Type}, 
      {"F", Int64.Type}, 
      {"G", Int64.Type}
    }
  ), 
  BufferTable = Table.Buffer(
    #"Changed Type", 
    [BufferMode = BufferMode.Delayed]
  )
in
  BufferTable

Then, when I run my refresh, the “Evaluating…” message only appears very briefly before the data starts to load:

It’s important to stress that after the “Evaluating…” phase the data load takes exactly the same amount of time – it’s only the “Evaluating…” phase that is faster. This can save you a lot of time as a developer, nonetheless. I have been told when these options are available in dataflows they will make validation (which occurs when you close Power Query Online after editing a dataflow) much faster too – in fact this developed to partially solve the dataflow validation problem.

[Thanks to Curt Hagenlocher for this information]

Update: something I should make clear is that this functionality is only useful for people who are already using Table.Buffer in their queries. If you’re not using Table.Buffer already then these changes won’t be of any benefit or interest.

Refreshing Excel Power Query Queries With VBA And Power Automate For Desktop

At the end of last year two new, fairly minor, improvements were made to Excel VBA’s support for Power Query: you can now refresh and delete individual Power Query queries. These methods are available now for everyone using the Insiders build of Excel. Not particularly interesting on its own (I wrote a post a few years ago about Excel VBA and Power Query if you’re interested) but it got me thinking about this subject again. Also, at the end of last year, I upgraded to Windows 11 which has Power Automate for desktop built in. Power Automate desktop makes it super easy to automate everyday tasks and it has great support for Excel – including the ability to run Excel VBA macros. So I wondered: can you use Power Automate for desktop to automatically refresh your Excel Power Query queries? Yes, you can!

Here’s a simple example. I created a Power Query query called GetTheDateAndTime that returns the current date and time in a table with one row and column. Here’s the M code for the query:

#table(type table [RunDate=datetime], {{DateTime.FixedLocalNow()}})

Next, I created a VBA macro called RunPQQuery to refresh this Power Query query using the new refresh method I mentioned earlier:

Sub RunPQQuery()
    ActiveWorkbook.Queries("GetTheDateAndTime").Refresh
End Sub

I then saved the Excel workbook as a .xlsm file.

Next I opened Power Automate for desktop and created a new desktop flow following the instructions here to open Excel, run the macro and close Excel again. I realised that if I closed Excel immediately after running the macro it would close Excel before the Power Query query had finished, so I added a delay of ten seconds after running the macro to give it time to finish. There are probably more sophisticated ways to solve this problem: for example you could read the value of a cell in the table returned by the query that you knew would change, then after running the query loop until the value you’ve read has changed. Here’s my desktop flow:

Finally I created a cloud flow to run this desktop flow:

And that’s it! A very simple example but very easy to implement.

Speed Up Power BI Refresh By Increasing The Amount Of Memory On Your On-Premises Data Gateway Machine

If your Power BI dataset needs to connect to an on-premises data source it will need to connect via an On-Premises Data Gateway; what’s more, if you have a Power Query query that combines data from cloud and on-premises data sources, then Power BI needs to connect to all data sources used (even if they are cloud sources) via an On-Premises Data Gateway. And when Power BI connects to a data source via a gateway all the transformation work done by the Power Query engine takes place on the machine where the gateway is installed.

As a result of all this the specification of the machine where the gateway is installed has an impact on the performance of any dataset refreshes that use it. So how powerful does the machine with the gateway installed on it need to be? That’s a tough question because, as you can probably guess, it depends on a lot of different factors: how many datasets get refreshed in parallel, how often, how complex the transformations used are, if you’re using Import mode or DirectQuery, and so on. There’s a great docs article describing how to go about sizing your gateway machine here. Unsurprisingly, the more memory and CPU cores you have available the better refresh performance is likely to be and the more refreshes can take place in parallel.

There is one important thing to point out that is not obvious though: increasing the amount of memory on your gateway machine can improve refresh performance even if it doesn’t look like the machine is under memory or CPU pressure. This is because the total amount of memory made available for a single refresh is calculated relative to the overall amount of memory available on the gateway machine. I’ve written about how the Power Query engine uses memory a few times: this post describes how each refresh can use a fixed maximum amount of memory and how performance suffers if your refresh needs to use more; this post shows how increasing the amount of memory Power Query can use for a single refresh can increase refresh performance dramatically. In short, the more memory on your gateway machine the more memory is available for each individual refresh and – if the Power Query engine needs it, for example if you’re sorting, merging, pivoting/unpivoting, buffering or doing group-bys on large tables and query folding is not taking place – the faster each refresh will be.

You do have the option of changing some properties (listed here) on the gateway to try to influence this behaviour. However since the algorithms involved are not documented and may change at any time, not all the relevant properties are documented, and working out what the optimal settings are yourself is very difficult, I don’t recommend doing this. It’s a lot easier just to increase the amount of memory and CPU on the gateway machine and let the gateway work out how these resources should be used. I’m not saying that you should blindly increase your memory and CPU as much as you can, though – you should test to see what the impact on refresh performance is (the gateway logs will be useful here) and whether that impact is worth the extra cost.

Bonus tip: another easy way to improve gateway refresh performance is to enable the StreamBeforeRequestCompletes property on the gateway. It’s documented here and a few people (see here and here for example) have already blogged about how much this has helped them.

Expanding Azure Data Explorer Dynamic Columns In Power Query

Azure Data Explorer has a data type called dynamic which can be used to hold scalar values as well as arrays and property bags; you can read about it in the docs here. For example (following on from my recent series on DirectQuery on Log Analytics, starting here) the ApplicationContext column in the PowerBIDatasetsWorkspace table that holds the IDs of the dataset, report and visual that generated a DAX query (see this post for more background) is of type dynamic:

This is what the contents of the column look like:

Now you can easily extract the individual property values from this column in KQL, and indeed I did so in the KQL queries in this post, but the interesting thing is you can also extract these values in Power Query M very easily and – crucially – maintain query folding using the Record.FieldOrDefault M function in a custom column without needing to write any KQL yourself, in both Import mode and DirectQuery mode.

In order to do this, first of all you have to enter a table name or KQL query in the third parameter of AzureDataExplorer.Contents function. When you do this you can treat a dynamic column as a record even if the Power Query UI doesn’t display it as such. Here’s an example M query that shows all of this in action on the PowerBIDatasetsWorkspace table that contains Power BI data in Log Analytics:

let
  Source = AzureDataExplorer.Contents(
    "InsertClusterName", 
    "InsertDBName", 
    "PowerBIDatasetsWorkspace", 
    [
      MaxRows                 = null, 
      MaxSize                 = null, 
      NoTruncate              = null, 
      AdditionalSetStatements = null
    ]
  ), 
  #"Added Custom" = Table.AddColumn(
    Source, 
    "Custom", 
    each Record.FieldOrDefault(
      [ApplicationContext], 
      "DatasetId"
    )
  )
in
  #"Added Custom"

From this query, here’s what the Custom Column dialog for the #”Added Custom” step looks like:

Here’s the output in the Power Query Editor:

And here’s the KQL query that this M query folds to (taken from the View Native Query dialog in the Power Query Editor):

PowerBIDatasetsWorkspace
| extend ["Custom"]=["ApplicationContext"]["DatasetId"]

I admit that this is a super-obscure tip but I think it’s fascinating nonetheless, especially given how nested structures are becoming more and more common in the world of big data. It would be great to have similar behaviour in other connectors…

Thanks to my colleague Itay Sagui (whose blog has several posts on Power BI/Azure Data Explorer integration) for this information.

Simulating Slow Data Sources In Power BI

As a postscript to my series on Power BI refresh timeouts (see part 1, part 2 and part 3) I thought it would be useful to document how I was able to simulate a slow data source in Power BI without using large data volumes or deliberately complex M code.

It’s relatively easy to create an M query that returns a table of data after a given delay. For example, this query returns a table with one column and one row after one hour and ten minutes:

let
  Source = Function.InvokeAfter(
    () => #table({"A"}, {{1}}), 
    #duration(0, 1, 10, 0)
  )
in
  Source

Some notes:

I’m using #table to return the table without having to connect to a data source. More details on how to use #table can be found here.
The delay is achieved using the Function.InvokeAfter M function, with the amount of time to wait for specified using #duration

A more interesting problem is how to create an M query that, instead of waiting for a given duration and then returning a table immediately, returns the rows of a table one at a time with a delay between each row. Here’s a query that does that, returning ten rows one second at a time:

let
  NumberOfRows = 10,
  DaysToWait = 0,
  HoursToWait = 0,
  MinutesToWait = 0,
  SecondsToWait = 1,
  Source = #table(
    {"A"},
    List.Transform(
      {1 .. NumberOfRows},
      each Function.InvokeAfter(
        () => {1},
        #duration(
          DaysToWait,
          HoursToWait,
          MinutesToWait,
          SecondsToWait
        )
      )
    )
  )
in
  Source

Last of all, to simulate a slow SQL Server data source – not being much good at TSQL at all – I borrowed some code from this thread on Stack Overflow to create a function that returns a scalar value after a specified number of seconds:

CREATE FUNCTION [dbo].[ForceDelay](@seconds int) returns int as 
BEGIN DECLARE @endTime datetime2(0) = DATEADD(SECOND, @seconds, GETDATE()); 
WHILE (GETDATE() < @endTime ) 
SET @endTime = @endTime;
return 1;
END

I showed how to call this function from Power Query using a native SQL query here.

Setting SQL Server CONTEXT_INFO In Power Query

In my quest to check out every last bit of obscure Power Query functionality, this week I looked into the ContextInfo option on the Sql.Database and Sql.Databases M functions. This option allows you to set CONTEXT_INFO in SQL Server (see here for a good article explaining what this is) and here’s an example of how to use it:

let
  Source = Sql.Database(
    "localhost", 
    "AdventureWorksDW2017", 
    [
      Query = "SELECT * FROM DIMDATE", 
      ContextInfo = Text.ToBinary(
        "Hello"
      )
    ]
  )
in
  Source

This Power Query query runs a simple SQL SELECT statement against the SQL Server Adventure Works DW 2017 database. Note that since you need to pass a binary value to the ContextInfo option, in this example I had to use the Text.ToBinary function to convert my text to binary.

Here’s what happens in SQL Server when this Power Query query is run:

Here’s a simple example of how to retrieve this data on the SQL Server side:

SELECT session_id, login_time, program_name, context_info 
FROM sys.dm_exec_sessions
WHERE session_id=57

I’ll leave it to the SQL Server experts to decide what this can be used for and no doubt to complain that it would be more useful to support SESSION_CONTEXT too – although I’ve heard that might already be used for something, so I need to do more research here…

Power BI/Power Query Data Privacy Settings And Errors Caused By Nested Values

Over the past few years I’ve blogged and presented extensively on the subject of Power Query’s data privacy settings (see here for a post with links to all this content). I thought I knew everything there was to know… but of course I didn’t, and I’ve recently learned about an issue that can cause mysterious errors.

As always it’s easiest to show an example of how it can occur. Here’s a table of airport names taken from the TripPin public OData feed:

Note how the Location column contains nested values of data type Record, and note that I have not expanded this column.

Here’s another query with sales data for these airports coming from Excel:

Now let’s say we want to join these two queries together using a Merge operation in the Power Query Editor. When you do this, because you are combining data from two different sources and because OData supports query folding, you will be prompted to set data privacy settings on these sources (unless you have already done so at some point in the past) because a Merge could result in data being sent from Excel to the OData source.

If you set the data privacy levels to Private on each source, like so:

…you are telling Power Query that it should never send data from these sources to any other source. As a result, Power Query has to load the data from both sources, buffer that data in memory, and do the Merge inside its own engine.

When you do the Merge everything looks normal at first:

But when you expand the Airports column you’ll see that the nested values in the Location column have been turned into the text “[Record]” and as a result can no longer be expanded.

This is because Power Query has had to buffer the values in the Airports query but it is unable to buffer nested values (I wrote about this here).

There are two ways to fix this. First of all, you can change the data privacy settings or turn them off completely. I don’t recommend turning them off completely because this is only possible in Power BI Desktop and not in the Power BI Service, and even changing the data privacy settings can lead to some unexpected issues later on. For example, if you set the data privacy levels for both sources to Public like so:

…then no buffering is needed (because data can now be sent from one source to another) and the nested values in the Location field can be expanded:

…and of course you do so:

However, people always forget that you have to set your data privacy levels again after publishing your dataset to the Power BI Service. And if you or someone else subsequently sets the data privacy levels back to Private you’ll get the following error in the output of the query:

The error message here is:

“Expression.Error: We cannot convert the value “[Record]” to type Record.”

Depending on the data type of the nested field you might get:

“Expression.Error: We cannot convert the value “[Table]” to type Table.”

“Expression.Error: We cannot convert the value “[List]” to type List.”

…instead.

The second way to fix the problem is easier and probably safer: you just need to expand the Location column before the Merge operation instead of after it. That way there are no nested fields present when the Merge takes place so all the values can be buffered. Here’s what the Airports table looks like after the Location column has been expanded, before the Merge:

…and here’s the output of the Merge even when the data privacy levels for both sources are set to Private:

Multi-Value Parameters In Power Query Online

There’s a nice new feature in Power Query Online (the version of Power Query used in Dataflows): parameters of type List. You can see this as a new option in the Type dropdown in the “Manage parameters” dialog:

Why is this interesting? In the past, Power Query parameters were always single values like a date or a string; now a parameter can contain mutliple values.

There’s one other new feature in Power Query Online that goes along with this: In and Not In filters, which can use these new List parameters.

I’m sure there are other cool things you can do with this but I’ll leave them to future blog posts.