Calculate(), Filter() And DAX Memory Usage

Following on from my last post on the Query Memory Limit in Power BI, in this post I want to look at one of the most common DAX antipatterns and its effect on query memory usage: filtering on a whole table, rather than an individual column, in the filter parameters of the Calculate() function. A lot has already been written on this particular antipattern from the point of view of query performance – for example see here and here – but it’s only with the recent addition of the Execution Metrics Profiler/Log Analytics event that you can see how bad it is for memory usage too.

Here’s a simple example. Using a semantic model based on the UK Land Registry’s Price Paid data, with two dimension tables (Date and Property Type) and a fact table called Property Transactions with just over a million rows:

…let’s say that you write a measure that counts the number of property transactions for detached properties, ie counting the rows in the Property Transactions table where the Property Type column contains the value “D”. Here’s a first attempt at that:

Detached Count V1 =
CALCULATE (
    COUNTROWS ( 'Property Transactions' ),
    FILTER ( 'Property Transactions', 'Property Transactions'[Property Type] = "D" )
)

Notice that this measure follows the antipattern of filtering on the whole Property Transactions table with the Filter() function just to filter on the Property Type column.

Here’s a DAX query that shows the results of this measure for all Postcodes. It returns around 153000 rows to show the memory impact of the way the measure is written.

EVALUATE
SUMMARIZECOLUMNS(
    'Property Transactions'[Postcode],
    "Detached Count", [Detached Count V1]
)

Here’s what the Execution Metrics event returns for this query:

{
	"timeStart": "2024-06-28T15:41:59.951Z",
	"timeEnd": "2024-06-28T15:42:00.388Z",

	"durationMs": 438,
	"vertipaqJobCpuTimeMs": 47,
	"queryProcessingCpuTimeMs": 391,
	"totalCpuTimeMs": 438,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 22246,

	"commandType": 27,
	"queryDialect": 3,
	"queryResultRows": 152583
}

In this case the memory usage of the query, as given by the approximatePeakMemConsumptionKB metrci, is 22246KB or about 23MB.

Now consider this version of the measure:

Detached Count V2 =
CALCULATE (
    COUNTROWS ( 'Property Transactions' ),
    'Property Transactions'[Property Type] = "D"
)

Running the same DAX query as before but with this new measure you get the same results but with a much lower memory usage of 8734KB or 9MB:

{
	"timeStart": "2024-06-28T15:56:09.431Z",
	"timeEnd": "2024-06-28T15:56:09.602Z",

	"durationMs": 172,
	"vertipaqJobCpuTimeMs": 31,
	"queryProcessingCpuTimeMs": 125,
	"totalCpuTimeMs": 172,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 8734,

	"commandType": 27,
	"queryDialect": 3,
	"queryResultRows": 152583
}

Query duration and CPU usage is reduced too, but as I said above this highlights how different DAX patterns can result in very different memory footprints. Filtering on just the column you want to filter, rather than the whole table, is a lot more memory efficient! If you’re getting the “This visual has exceeded the available resources error because you’ve hit the Query Memory Limit then it’s worth checking to see if you have used this antipattern in your measures and rewriting accordingly.

Footnote: it’s worth mentioning that two other variations on the rewritten measure, one using the Property Type column on the Property Type dimension table instead of the Property Type column on the fact table:

Detached Count V3 =
CALCULATE (
    COUNTROWS ( 'Property Transactions' ),
    'Property Type'[Property Type] = "D"
)

…and one that added the KeepFilters() function (which is commonly used in this type of measure):

Detached Count V4 =
CALCULATE (
    COUNTROWS ( 'Property Transactions' ),
    KEEPFILTERS ( 'Property Type'[Property Type] = "D" )
)

…both had the same memory usage as the V2 version of the measure, 9MB.

Power BI Semantic Model Memory Errors, Part 4: The Query Memory Limit

Continuing my series on Power BI memory errors (see part 1, part 2 and part 3), in this post I’ll look at the query memory limit which controls the amount of memory that an individual query can consume. This is a subject which I’ve talked about in a few blog posts in the past: hitting this limit is one of the reasons you’ll see the “This visual has exceeded the available resourceserror which I blogged about here and and here. There’s also some official documentation here which is pretty good.

As I mentioned in part 1 of this series, when you run a DAX query against a Power BI semantic model it will use memory. How much memory it uses will depend on your data, how you have modelled it, what data your query is requesting and how you have written any DAX calculations used by your query. For obvious reasons we at Microsoft can’t let queries use an infinite amount of memory in the Power BI Service so there is a limit – the query memory limit – on how much memory an individual query can use. This limit varies depending on whether you are using Shared (also known as Power BI Pro) or Premium/Fabric capacity, and by the size/SKU of the capacity you’re using. The different limits for each SKU are documented here.

While you can’t increase the query memory limit over the published thresholds, capacity admins can reduce the query memory limit in the admin portal by setting the “Query Memory Limit %” property on the capacity.

The default value of this property is 0, which means the maximum value for the capacity SKU you’re using will be applied; any other value will be interpreted as a percentage of that maximum value. Reducing this value can be a good way to stop your developers from using inefficient DAX in their reports and consuming too many resources on your capacity – it won’t directly reduce a query’s CU usage but DAX expressions that are very memory-intensive often use a lot of CUs too.

If the DAX query generated by a visual in a Power BI report hits the query memory limit you’ll see an error in your report. The first reaction many Power BI developers have when they hit the limit is to ask if there is a way to increase it – which you can only do by buying a larger capacity – but to be brutally honest, if you are hitting the query memory limit you have done something wrong and you need to fix your model, your DAX calculations or your report to reduce memory usage. Doing this is likely to improve report performance too. If you’re encountering the error in Power BI Desktop you can raise the limit but this is only possible in Desktop and if you do this you may end up getting the error again after you publish.

Here’s the error you’ll see in Power BI Desktop if you hit the query memory limit:

This visual has exceeded the available resources. Try filtering to decrease the amount of data displayed.

If you hit the query memory limit in a published report you’ll see this error message:

In this case you not only get the “Visual has exceeded the available resources” error but going to the details shows a more helpful message that tells you what the current query memory limit is:

Resource Governing: This query uses more memory than the configured limit. The query — or calculations referenced by it — might be too memory-intensive to run. Either simplify the query or its calculations, or if using Power BI Premium, you may reach out to your capacity administrator to see if they can increase the per-query memory limit. More details: consumed memory 1 MB, memory limit 1 MB. See https://go.microsoft.com/fwlink/?linkid=2159752 to learn more.

The error number associated with this error is 0xC13E0004 (-1052901372).

How do you know how much memory your query is using to check how near you are to the query memory limit? The new Execution Metrics event in Profiler and Log Analytics will tell you. If you find the event associated with the Query End event for the query (something I discussed here) the approximatePeakMemConsumptionKB metric gives the peak memory usage of the query reached while it was running, and it is this value that must not exceed the Query Memory Limit.

{
	"timeStart": "2024-06-21T16:24:16.608Z",
	"timeEnd": "2024-06-21T16:24:18.326Z",

	"durationMs": 1719,
	"vertipaqJobCpuTimeMs": 1078,
	"queryProcessingCpuTimeMs": 531,
	"totalCpuTimeMs": 1609,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 46662,

	"commandType": 27,
	"queryDialect": 3,
	"queryResultRows": 502
}

If you can’t use Profiler or Log Analytics there’s no direct way of knowing how much memory is using. You can however set a custom memory limit in Power BI Desktop and keep changing it until you see, or don’t see, the query memory limit and therefore estimate how much memory your query is using.

How can you reduce the amount of memory that your queries use? That’s a question that’s too big to answer in any one blog post and this one is already long enough. In my next few posts I will look at some common causes of excessive memory usage and what you can do to fix them.

The “DataFormat.Error: File contains corrupted data” Error In Power Query

When you’re using Excel workbooks as a data source in Power Query in either Excel or Power BI you may encounter the following error when trying to connect:

DataFormat.Error: File contains corrupted data.

There are three possible causes of this error that I know of, two of which are well documented. First, the file may actually be corrupt in which case I hope you have a backup! Second, the Excel file may be password protected. Power Query cannot connect to a password protected Excel file so you’ll need to remove the password protection before you can use it as a source.

The third is one I ran into this week and it confused me a lot. It turns out that if the Excel file has a sensitivity label applied to it that results in it being encrypted, Power Query in Excel will not be able to connect to it – although Power Query in Power BI will be able to. When connecting to an encrypted Excel file, Power Query in Power BI is able to decrypt the file using the credentials of the logged-in user and read it, but (at least at the time of writing) Power Query in Excel cannot do this and so you get the error above. In my case I had an Excel workbook that used Power Query to connect to a few other Excel workbooks, load data from them, and do some tax calculations. It was working a few months ago but when I reopened it this week I got the “DataFormat.Error” error and I couldn’t work out why. It turns out that in the meantime an admin at Microsoft had applied a company-wide policy that meant all workbooks stored in OneDrive for Business had a highly restrictive sensitivity label applied automatically – which means my Power Query queries stopped working. As soon as I changed the sensitivity label on my source workbooks to “Non Business” so they weren’t encrypted, everything worked again.

[Thanks to Curt Hagenlocher for this information]

Get Better Results From Power BI Copilot With Linguistic Modelling

Everyone is excited about Power BI Copilot, and the newly-announced preview of being able to use Copilot to ask questions about all the data in your semantic model rather than just what is shown in a report is a massive step forward. However amazing the LLMs used behind the scenes are, though, the quality of the results your users get from this new Copilot preview depends on a number of factors that you as a developer control. How you model your data is very important and as the announcement blog mentions, so is the linguistic schema that was originally added for Copilot’s predecessor feature, Q&A. Copilot returns much better results than Q&A ever did but the combination of Copilot and the extra information that Q&A’s linguistic schema provides (information Copilot could not know unless you told it) makes Copilot even more powerful. What’s more, you don’t need to edit a YAML file to use this functionality any more because most of the features of the linguistic schema are now available to edit in Power BI Desktop’s Q&A Setup dialog.

In this blog post I’ll show you a few examples of how adding to a model’s linguistic schema improves the new Power BI Copilot preview’s results when you’re querying your semantic model.

Semantic Model

Let’s say you own a farm where you grow fruit. Customers visit the farm to buy fruit and the fruit is picked for these customers by your employees. You store your sales data in a Power BI semantic model that looks like this:

The Orders table is a fact table with one row for each order line. The Order ID column identifies the order that each line item is linked to, the Amount column contains the sales value and the Units column contains the number of fruit delivered for each line. The Employee dimension gives the name of the employee who picked the fruit; the Customer dimension gives the name of the customer who ordered the fruit; the Product dimension gives the name of the fruit picked. Products are associated with Product Groups via a many-to-many relationship.

Here’s the data in the dimension tables:

To get the best results from Copilot note that:

  • The data is modelled as a classic star schema.
  • The table and column names are in human-readable English with no abbreviations and with spaces between the words. I talked about my opinions on Power BI naming conventions in this blog post.
  • All tables and columns that should not be shown in a report have been hidden.
  • The fact table measure columns have been hidden and three explicit measures – Order Amount, Order Units and Order Count (which is a distinct count on the Order ID column) – have been created.

Synonyms

While Copilot performs well on this model, let’s look at a simple question where it doesn’t return the expected result:

Show the number of orders by employee

The prompt returns a visual, but on closer inspection it’s not the result you want. It shows the count of rows in the Orders table which is the number of line items, not a count of orders:

To get the correct result you need to tell Copilot that the Order Count measure returns the number of orders by defining a Synonym. You can do this in the Q&A setup dialog in Power BI Desktop on the Synonyms tab:

Setting “number of orders” as a synonym for the Order Count measure means that the prompt now returns the following visual with the results you want:

Verbs

The next prompt to look at is:

Who picked lemons?

You know that on our farm it’s the employees who pick the fruit but there’s nothing in the model to tell Copilot that. As a result the prompt above results in Copilot saying that it doesn’t know what “picked” means in this context:

On the relationships tab of the Q&A Setup dialog you can fix this by defining a Verb relationship:

The relationship tells Copilot that “Employee names” pick “Product names” with the Orders table connecting the two columns.

With this relationship in place, Copilot correctly answers that Gabi was the only employee who picked lemons:

Nouns

The customer Chris is also widely referred to by employees as “Mr Webb”, but that name isn’t stored anywhere in the model. As a result the prompt

How much money did we make from Mr Webb?

results in the following, fairly reasonable, response:

However with a noun relationship set up to tell Copilot that “Mr Webb” is a kind of customer name where customer name equals “Chris”:

Then the result is what you would expect:

Dynamic Nouns

Copilot does a good job with the many-to-many relationship between Product and Product Group without any optimisation. For example the prompt:

show all citrus fruit and their order amounts

Returns the correct result:

But let’s say that in this case you want to show the individual products rather than the product group “citrus fruit”. You can do this by setting up a dynamic noun relationship:

The relationship is that”Product group names” define kinds of “product names” with the Product To Product Group table linking the two. With this in place the prompt now returns the desired result:

Conclusion

These examples barely scratch the surface of what’s possible with the linguistic schema and Copilot. Apart from the documentation, I found the videos on the (fairly old) “Natural Language for Power BI” YouTube channel which were created when Q&A was launched useful for understanding the concepts here too. There’s a lot to learn here but with some trial and error, as well as listening to feedback from your end users, you should be able to tune Copilot so it returns high quality results almost all the time.

Power BI Semantic Model Memory Errors, Part 3: The Command Memory Limit

Continuing my series on Power BI model memory errors (see part 1 and part 2), in this post I will look at the Command Memory Limit which restricts the amount of memory that XMLA commands like Create, Alter and most importantly Refresh can use.

If you’ve ever been told that your semantic model should consume less than half the amount of memory available to it because memory consumption can double during a full refresh, then that is because of the Command Memory Limit. Every time a model is refreshed in Power BI, that refresh is initiated by running a Refresh command. During the refresh a copy of the model is created in the background and it is the copy that is refreshed; when the refresh is completed, Power BI deletes the original version of the model and replaces it with the copy. While the refresh is in progress, the memory consumed by this copy of the model and all the operations needed to load data into it (including any Power Query queries used to get data from your data sources and to transform that data) is associated with the Refresh command. The Command Memory Limit specifies how much memory the Refresh command is allowed to use.

The good news is that there is an excellent, detailed explanation of the Command Memory Limit in the docs here which I recommend you read before continuing:

https://learn.microsoft.com/en-gb/power-bi/enterprise/troubleshoot-xmla-endpoint#resource-governing-command-memory-limit-in-premium

What it says is that the amount of memory that XMLA commands like Refresh can use is the maximum allowed size for a semantic model for the capacity you’re using (as documented in the table here in the Max Memory column) minus the amount of memory the semantic model is using when the command starts.

Let’s look at an example of an error caused by exceeding the Command Memory Limit.

I created an Import mode semantic model whose total size was 3.3GB, and which consisted of a single partitioned table with 20 columns, each of which contained random decimal numbers. I refreshed this model on an F64 capacity where the maximum allowed memory per model is 25GB and the refresh succeeded.

How much memory did the Refresh command use? There are two ways to find out. For a while now you have been able to get the approximate peak memory usage during a refresh (along with the approximate peak memory usage just for Power Query queries, which is a subset of this figure) from the Command End event associated with that refresh in Profiler and Log Analytics; I blogged about this here. However the new Execution Metrics event in Profiler and Log Analytics makes the same information even easier to extract. Here’s what the Execution Metrics event for the refresh looked like in this case:

{
	"timeStart": "2024-05-26T17:18:29.984Z",
	"timeEnd": "2024-05-26T17:26:06.577Z",

	"durationMs": 456593,
	"directQueryExecutionTimeMs": 6068,
	"vertipaqJobCpuTimeMs": 448938,
	"mEngineCpuTimeMs": 362094,
	"totalCpuTimeMs": 1565781,
	"executionDelayMs": 370,

	"approximatePeakMemConsumptionKB": 6074967,
	"mEnginePeakMemoryKB": 530688,

	"tabularConnectionTimeoutMs": 18000000,

	"commandType": 2,
	"refreshParallelism": 6,
	"vertipaqTotalRows": 15000000,
	"qsoReplicaVersion": 133612179663049757,
	"intendedUsage": 2
}

[Note: if you’re looking in Profiler or Log Analytics you’ll see a lot of Execution Metrics events. Typically the Execution Metrics events for a refresh will be generated immediately after the Command End event for that refresh, but to be sure you should look for matching values in the RequestId column (in a Profiler trace) or the XmlaRequestId column (in Log Analytics) to associate an Execution Metrics event with a Command End event.]

The metric to look at above is approximatePeakMemConsumptionKB and it shows the refresh used about 6,074,967KB or 5.8GB at its peak – a lot more than double the size of the model before or after the refresh. I designed the model specifically for this to happen (20 columns of random decimal numbers is not easy to compress) and in most cases the memory usage will be lower relative to the size of the model.

I then scaled the capacity down to an F16 which only has a memory limit of 5GB for semantic models and refreshed again. As you would expect, the refresh failed with the following error:

Resource Governing: This operation was canceled because there wasn’t enough memory to finish running it. Either reduce the memory footprint of your dataset by doing things such as limiting the amount of imported data, or if using Power BI Premium, increase the memory of the Premium capacity where this dataset is hosted. More details: consumed memory 1780 MB, memory limit 1779 MB, database size before command execution 3340 MB. See https://go.microsoft.com/fwlink/?linkid=2159753 to learn more. The current operation was cancelled because another operation in the transaction failed.

This is the error message associated with hitting the Command Memory Limit (the associated error number is -1052901373). What this message is saying is that the model was consuming 3340MB (3.3GB) before the refresh started, then the refresh itself reached a maximum of 1780MB (1.7GB) but at that point it was cancelled because 5GB-3.3GB=1.7GB. Here’s the Execution Metrics data for the refresh Command:

{
	"timeStart": "2024-05-27T10:11:53.985Z",
	"timeEnd": "2024-05-27T10:12:56.834Z",

	"durationMs": 62849,
	"directQueryExecutionTimeMs": 3167,
	"mEngineCpuTimeMs": 29813,
	"totalCpuTimeMs": 80375,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 1821523,
	"mEnginePeakMemoryKB": 518648,

	"tabularConnectionTimeoutMs": 18000000,

	"commandType": 2,
	"errorCount": 2,
	"refreshParallelism": 4,
	"vertipaqTotalRows": 933137,
	"intendedUsage": 2
}

So is it even going to be possible to refresh this model on an F16 capacity given that refreshing it requires 5.9GB? Well the most memory efficient way of refreshing a semantic model without changing it is to clear all the data out from it first, then refresh just the data in each partition in each table one at a time, then do a recalc, all in separate transactions – which is exactly what the partialBatch commit mode of the Enhanced Refresh API (which I blogged about recently here) does if you set its max_parallelism property to 1. So I tried this – and it failed again ☹️. Unsurprisingly the first refresh operation of type ClearValues ran successfully and peaked at 0.01GB of memory, after which the size of the model in memory would have been negligible. Then the refresh of type DataOnly for first partition in the table was successful but the refresh operation peaked at 3.4GB. The second partition refresh of type DataOnly then failed; I assume it would have also peaked at 3.4GB but the error message told me that the model was already 2.4GB in size when the refresh started so the memory limit for this refresh was 2.6GB.

If you’re hitting the Command Memory Limit and you don’t want to increase the size of your capacity but you are willing to make changes to your model, then there could be several ways to reduce the amount of memory used during a refresh – but in order to understand which method will work you will need to know what is using memory during the refresh and that isn’t easy. Using the partialBatch commit option with the Enhanced Refresh API does at least split the refresh out into its constituent parts and makes it easier to see at which stage the refresh fails. In this case refresh failed at the DataOnly stage so creating more, smaller partitions could help. Following the steps for reducing model size given in blog posts like this one by Nikola Illich will also help a lot to reduce memory consumption during a refresh. Other common culprits for high memory consumption during a refresh include calculated columns and tables, or Power Query queries that buffer large amounts of data in memory (for example because of transformations that sort or aggregate data and which do not fold). Therefore following Roche’s maxim and doing all your transformations and creating the equivalent of your calculated columns and tables in your data source, before the data is loaded into Power BI, will help. If you do have to use calculated columns and tables then you should look at tuning your DAX expressions so they use less memory. I’ll look at some examples of how to tune your Power Query queries or DAX expressions to reduce memory in future posts.