Power BI/AI Book Roundup

Here’s another one of my occasional posts about books I’ve been sent free copies of. Full disclosure: as always, these aren’t reviews as such, they’re more like free publicity in return for the free books and I don’t pretend to be unbiased; also the Amazon UK links have a affiliate code in that gives me a kickback if you buy any of these books.

The AI Value Playbook, Lisa Weaver-Lambert

What am I doing covering an AI book here? Lisa is an ex-colleague of mine at Microsoft and I respect her opinions. Also, I suspect like a lot of you, I have mixed feelings about the current AI boom: I can see the value in AI but I can also see the vast amount of hype and the obviously ridiculous claims being made. More than anything I see senior executives talking confidently about a subject I’m sure they don’t understand, and that is clearly a big problem. This book aims to help solve that problem by providing a practical guide to AI for non-technical leaders, in the form of a series of case studies and interviews with entrepreneurs and C-level people in the AI space. This is a very readable book – Lisa has talked to a lot of interesting, knowledgeable people – and the format makes it a lot more palatable for the target audience of your boss’s boss’s boss than your average tech book. As a technical person who isn’t by any means an AI expert I also enjoyed reading it.

The Complete Power BI Interview Guide, Sandielly Ortega Polanco, Gogula Aryalingam and Abu Bakar Nisa Alvi

Spend any time on public Power BI forums and you’ll see a lot of questions from people who want to know how to start a career in Power BI or get tips for Power BI interviews; as a result I’m sure there’s a big market for a book like this. It’s a mix of technical topics (the type that you might be asked about in a technical interview for a Power BI job) and non-technical advice such as how to network on LinkedIn, negotiate salaries and acecpt or reject a job offer. That might seem a bit of a strange combination but it works and the advice is both detailed and very sensible, so as a result I would have no hesitation in recommending this to anyone trying to get a job as a Power BI developer.

Power BI Refresh, Memory Usage And Semantic Model Scale Out

In my recent posts on the Command Memory Limit error and the partialBatch mode for Power BI semantic model refresh, I mentioned that one way to avoid memory errors when refreshing large semantic models was to run use refresh type clearValues followed by a full refresh – but that the downside of doing this was that your model would not be queryable until the full refresh had completed. Immediately afterwards some of my colleagues (thank you Alex and Akshai) pointed out that there was in fact a way to ensure a model remained queryable while using this technique: using Semantic Model Scale Out. How? Let me explain…

Problem Recap

As I described here, when you run a full refresh on a semantic model that already has data loaded into it, the original version of the model stays in memory while the new data is loaded in, which means that the overall amount of memory required for the refesh is around double the amount needed to store the model. Since there are limits on the amount of memory that a model can use which vary depending on whether you’re using a capacity or not and what size capacity you’re using, if you have a large model then you might find refresh fails with a memory error. Running a refresh of type clearValues first removes all the data from the model, reducing its memory usage, and improves the chances of a full refresh using less than the allowed maximum amount of memory and therefore succeeding. Removing all the data from your model, though, means that your reports won’t show any data until you do a full refresh.

How can Semantic Model Scale Out help?

Semantic Model Scale Out (also known as Query Scale Out) is primarily a performance feature: by creating multiple replicas of your semantic model, one which is used for refreshes and one or more that are used to answer queries, it prevents refreshes from interfering with report performance and allows report queries to be distributed over multiple physical nodes. All of these replicas of the semantic model need to be kept in sych, which means that when the model that is used for refresh has new data loaded into it, all the other versions of the model need to be replaced by this new version. By default this synchronisation happens automatically but you can also turn this off and control the synchronisation manually.

As a result, if you turn on Semantic Model Scale Out and turn off automatic synchronisation, you can run a refresh of type clearValues to clear the data from your model and then run a full refresh without affecting your end users – who won’t see the new data until you do a manual synchronisation.

Example

Let’s see a simple illustration of how this works. Using the same semantic model I used in this post, hosted on an F64 capacity, I created a report with a single card visual:

With Scale Out turned off, I ran a refresh of type clearValues. This was very fast and after it had finished – as you would expect – the card visual showed a blank value because all of the data had been cleared from the model:

I then ran a full refresh and when it completed the card showed data again.

Next, I turned on Scale Out in the settings pane of the semantic model:

…and turned off automatic replica synchronisation using the script here.

My colleague Michael Kovalsky has very kindly added functionality to Semantic Link Labs to handle manual replica synchronisation, which means it was extremely simple to run the refreshes and synchronise replicas from a Fabric notebook. After installing Semantic Link Labs:

%pip install semantic-link-labs

…I used the following Python code in a notebook cell to run a refresh of type clearValues, followed by a full refresh, followed by a manual replica sychronisation:

import sempy_labs as labs
WorkspaceName = "Model Memory Tests"
SemanticModelName = "ModelMemoryDemo3"
# run a refresh of type clearValues first
labs.refresh_semantic_model(dataset=SemanticModelName, workspace=WorkspaceName, refresh_type="clearValues")
# then a refresh of type full
labs.refresh_semantic_model(dataset=SemanticModelName, workspace=WorkspaceName, refresh_type="full")
# then a manual replica sync
labs.qso_sync(dataset=SemanticModelName, workspace=WorkspaceName)

And as expected the card visual shown above showed data all through the refreshes.

SQLBits Power BI/Fabric Session Recordings Now Available

One of the great things about the SQLBits conference in the UK is the way that all the session recordings are made available for for anyone to view for free on their YouTube channel. All the recordings from the 2024 conference have now been posted and it’s a goldmine for anyone who wants to deepen their Power BI or Fabric knowledge.

Here are the recordings for my three sessions:

Fabric Direct Lake Deep Dive

What’s new in Power Query in Power BI, Fabric and Excel?

Connect Excel to Power BI data with Live Connected Tables

There are so many other great sessions that it’s hard to pick out any highlights, but there are two that you might easily miss which I think are ones you should watch.

First, this session on DAX fusion by my colleague Phil Seamark is one I learned a lot from. It sounds like an obscure topic but it’s extremely important for anyone trying to optimise their measures (including for DirectQuery models):

Second, this session by another colleague of mine, Kasper de Jonge, on how security works with Fabric – not just data security but security of the platform as a whole – is a must-watch because so many of the assumptions people have are not relevant for a SaaS platform like Fabric:

Why Power BI Table Visuals With Scrollbars Can Cause Problems

Concluding my series of blog posts on seemingly harmless things you can do in a Power BI report that can lead to performance problems and the “This visual has exceeded the available resources” error (see also “Calculate(), Filter() and DAX memory usage” and “DAX measures that never return blank“), in this post I’ll show how table visuals with vertical scrollbars that potentially show thousands of rows can be a Bad Thing.

Using the Import mode semantic model I’ve used in the last two posts, consider this table visual:

The two measures displayed are trivial counts and averages. However there are 538753 combinations of County and Postcode that have data, and so in theory the table can display 538753 rows if you scroll down far enough. That can’t be good, can it?

Those of you who know how Power BI generates DAX queries for table visuals like this might disagree, though. Here’s the DAX query for this table visual:

DEFINE
	VAR __DS0Core = 
		SUMMARIZECOLUMNS(
			'Property Transactions'[County],
			'Property Transactions'[Postcode],
			"Count_Of_Sales", 'Property Transactions'[Count Of Sales],
			"Average_Price_Paid", 'Property Transactions'[Average Price Paid]
		)

	VAR __DS0PrimaryWindowed = 
		TOPN(501, __DS0Core, 'Property Transactions'[County], 1, 'Property Transactions'[Postcode], 1)

EVALUATE
	__DS0PrimaryWindowed

ORDER BY
	'Property Transactions'[County], 'Property Transactions'[Postcode]

As you can see, the query contains a TOPN filter which means that it actually only returns the first 501 rows out of those potential 538753. It’s only if you scroll down far enough that another query is triggered to get the next 501 rows. So maybe it’s not so bad?

Actually it can be bad, even with the TOPN filter. Here are the Execution Metrics for the query above:

{
	"timeStart": "2024-07-10T22:54:30.330Z",
	"timeEnd": "2024-07-10T22:54:31.064Z",

	"durationMs": 734,
	"vertipaqJobCpuTimeMs": 219,
	"queryProcessingCpuTimeMs": 516,
	"totalCpuTimeMs": 734,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 66599,

	"commandType": 27,
	"queryDialect": 3,
	"queryResultRows": 501
}

Notice the durationMS and approximatePeakMemConsumptionKB metrics, which tell you how long the query takes to run and how much memory it uses.

Now consider the following version of a report with a slicer for the County column and the county “Bath and North East Somerset” selected.

Here’s the DAX query for the table visual now:

DEFINE
	VAR __DS0FilterTable = 
		TREATAS({"BATH AND NORTH EAST SOMERSET"}, 'Property Transactions'[County])

	VAR __DS0Core = 
		SUMMARIZECOLUMNS(
			'Property Transactions'[County],
			'Property Transactions'[Postcode],
			__DS0FilterTable,
			"Count_Of_Sales", 'Property Transactions'[Count Of Sales],
			"Average_Price_Paid", 'Property Transactions'[Average Price Paid]
		)

	VAR __DS0PrimaryWindowed = 
		TOPN(501, __DS0Core, 'Property Transactions'[County], 1, 'Property Transactions'[Postcode], 1)

EVALUATE
	__DS0PrimaryWindowed

ORDER BY
	'Property Transactions'[County], 'Property Transactions'[Postcode]

As you would expect, this query now contains a filter on the county selected in the slicer. It also contains a TOPN filter but the interesting thing is that since there are more than 501 postcodes in the selected county, the query returns exactly the same rows as before. Of course you can’t now scroll down and see data for other counties than the one selected in the slicer but you can always select a different county in the slicer.

Here are the Execution Metrics for the new query:

{
	"timeStart": "2024-07-10T22:58:29.725Z",
	"timeEnd": "2024-07-10T22:58:29.741Z",

	"durationMs": 16,
	"vertipaqJobCpuTimeMs": 0,
	"queryProcessingCpuTimeMs": 16,
	"totalCpuTimeMs": 16,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 2564,

	"commandType": 27,
	"queryDialect": 3,
	"queryResultRows": 501
}

Notice that both the duration and peak memory usage of the query are both much, much lower: duration has gone from 734ms to 16ms, while peak memory usage has gone from 66599KB to 2564KB. If you have more measures which are more complex and/or inefficiently written the difference could be much bigger.

Why is there such a big difference? Well if you look at the physical query plan generated for both queries you can see that before the TOPN filter is applied, the Vertipaq engine still has to spool the entire, unfiltered table. In the first query this unfiltered table has 538753 rows, which is why it’s relatively slow and memory hungry, whereas in the second query this table, with the County slicer, has only 1869 rows (the number of postcodes in the selected county). As a result slicing on County, even though it doesn’t affect which rows are returned after TOPN filter is applied, leads to a more efficient query. If you’re using DirectQuery mode something similar happens because, as I noted here, the TOPN filter can only be pushed back to the relational source in a few basic scenarios which means the SQL queries generated can return a very large number of rows and which can hit the Max Intermediate Row Set Count limit.

What can you learn from this? Table visuals that can display a large number of rows should be avoided. I’m not report design guru but I would go so far as to say that any table with a scrollbar on is a mistake: if your user needs to scroll down to see all the rows or scroll right to see all the columns, why not use a slicer to allow them to select the rows they want to see more easily or use field parameters to allow them to select the columns they want to see? Not only will this be better from a usability point of view, performance will be a lot better and memory usage will be lower. And yes, I know that your end users ask for tables with lots of rows and columns because that’s what they are used to seeing in Excel, but giving them what they want risks slow performance and memory errors. So they need to make a choice.

DAX Measures That Never Return Blank

Following on from my earlier post on the Query Memory Limit in Power BI, and as companion to last week’s post on how a DAX antipattern using Calculate() and Filter() can lead to excessive memory consumption by queries (and therefore lead to you hitting the Query Memory Limit), in this post I want to look at the effects of another DAX antipattern on performance and memory usage: measures that can never return a blank value.

Adding zero to the result returned by a measure

What do I mean by a measure that never returns blank? Well using the same model that I used in last week’s post, let’s say I have a simple measure that counts the rows in my fact table:

Count Of Sales = COUNTROWS('Property Transactions')

I can use this measure in a matrix visual with Property Type on columns and Postcode on rows, like so:

Here’s what Execution Metrics says about the DAX query for this visual:

{
	"timeStart": "2024-07-03T10:16:16.251Z",
	"timeEnd": "2024-07-03T10:16:19.345Z",

	"durationMs": 3094,
	"vertipaqJobCpuTimeMs": 188,
	"queryProcessingCpuTimeMs": 2656,
	"totalCpuTimeMs": 2844,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 179404,

	"commandType": 27,
	"queryDialect": 3,
	"queryResultRows": 230
}

And here’s what DAX Studio’s Server Timings shows:

It’s not fast (there are a lot of Postcodes even though the query has a Topn N filter and only returns 230 rows – that’s a subject for another blog post though) but the thing to look at is the peak memory consumption which is 179404KB or 175MB and the fact that there is just one Storage Engine query.

Now let’s say your boss says she hates all that white space in the matrix and wants to see zeroes instead. The obvious way to fix this is to add a zero to the result of the count in the measure, like so:

Count Of Sales = COUNTROWS('Property Transactions')+0

It’s a minor change so what could go wrong? Here’s what Execution Metrics shows after making that change:

{
	"timeStart": "2024-07-03T10:22:01.343Z",
	"timeEnd": "2024-07-03T10:22:10.390Z",

	"durationMs": 9047,
	"vertipaqJobCpuTimeMs": 313,
	"queryProcessingCpuTimeMs": 8578,
	"totalCpuTimeMs": 8891,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 379432,

	"commandType": 27,
	"queryDialect": 3,
	"queryResultRows": 612
}

And here are the Server Timings:

The query is now three times slower, has three Storage Engine queries, and uses 370MB of memory – twice as much as before! This increase in memory can easily push you over the Query Memory Limit and result in the “This visual has exceeded the available resources” error, as this real-life example shows.

Why does this change make such a big difference? Adding zero to the result of the count results in the creation of a “dense” measure, that’s to say a measure that never returns a blank value. The white space in the first visual was the result of the measure not returning any values, whereas in the second visual the measure is returning values for every cell in the matrix. What is a blank value? I strongly recommend you read Marco’s article here for a detailed introduction to this subject; blanks are somewhat similar to nulls in SQL but there are some very important differences, and in this case the thing to point out is that the expression BLANK()+0 returns 0 in DAX. As a result a measure which used to return a blank value in some cases now never returns a blank. This can be extremely bad for DAX performance and memory usage because the Vertipaq engine inside Power BI can do a lot of optimisations when it knows a measure will return blank in some circumstances. Nikola Illic has a great post here going into more detail about this.

Using FORMAT() in measures

There are a few other common ways you can run into the same problem. One is the use of the FORMAT() function to return a formatted string from a measure, for example:

Count Of Sales = FORMAT(COUNTROWS('Property Transactions'),"Standard")

Here’s what Execution Metrics shows for the visual above with the new version of the measure:

{
	"timeStart": "2024-07-04T09:48:43.655Z",
	"timeEnd": "2024-07-04T09:48:53.499Z",

	"durationMs": 9844,
	"vertipaqJobCpuTimeMs": 281,
	"queryProcessingCpuTimeMs": 9359,
	"totalCpuTimeMs": 9656,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 380480,

	"commandType": 27,
	"queryDialect": 3,
	"queryResultRows": 612
}

As you can see, the duration and memory usage is very similar to the version of the measure above that added a zero to the result of the COUNT(): slow and memory hungry.

The FORMAT() function returns a text value that represents a formatted version of the value passed in; blank values are converted to empty strings, so again this new version of the measure can never return a blank value. My recommendation is to always try to avoid using FORMAT() in measures because of the impact it can have on performance. Instead, try to use format strings (for more examples see my posts here and here) or dynamic format strings (you should also read the SQLBI article on dynamic format strings to be aware of the implications of using them) wherever possible. It’s a shame that format strings don’t have built in support for formatting blank values; I’m told there are some good technical reasons why this hasn’t happened yet.

IFERROR() and DIVIDE()

Some other DAX functions can also trigger this performance issue. The IFERROR() or ISERROR() functions are one example, since they allow you to trap errors in DAX expressions and return a value instead. However I can say quite confidently that you should never, ever use IFERROR() or ISERROR() when writing DAX anyway and there is some guidance documentation to back me up on this.

Another example which is less obviously bad is the use of the third parameter of the DIVIDE() function. As you probably know, the DIVIDE() function allows you to handle division by zero and division by blank safely, and by default it will return a blank value when division by zero or division by blank is detected. However there is an optional third parameter which allows you to return a value instead of a blank when dividing by zero or blank, and this can also trigger the problem we’re looking at.

Consider the following measure:

Division Test = 
DIVIDE(
    1,
    COUNTROWS('Property Transactions')
)

Here it is used in our visual:

Here are the Execution Metrics:

{
	"timeStart": "2024-07-04T11:42:29.636Z",
	"timeEnd": "2024-07-04T11:42:32.386Z",

	"durationMs": 2750,
	"vertipaqJobCpuTimeMs": 234,
	"queryProcessingCpuTimeMs": 2516,
	"totalCpuTimeMs": 2750,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 210960,

	"commandType": 27,
	"queryDialect": 3,
	"queryResultRows": 230
}

Now, here’s the same measure with the third parameter of DIVIDE() set:

Division Test = 
DIVIDE(
    1,
    COUNTROWS('Property Transactions'),
    0
)

And here are the Execution Metrics:

{
	"timeStart": "2024-07-04T11:44:24.997Z",
	"timeEnd": "2024-07-04T11:44:32.247Z",

	"durationMs": 7250,
	"vertipaqJobCpuTimeMs": 359,
	"queryProcessingCpuTimeMs": 6891,
	"totalCpuTimeMs": 7250,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 379436,

	"commandType": 27,
	"queryDialect": 3,
	"queryResultRows": 612
}

Once again, the second version of the measure that never returns a blank is slower and uses more memory.

Summary

As you have seen, Power BI measures that can never return a blank value can be very bad for performance and query memory usage – so as a result, when you are writing DAX you should always be careful to avoid doing this.

Calculate(), Filter() And DAX Memory Usage

Following on from my last post on the Query Memory Limit in Power BI, in this post I want to look at one of the most common DAX antipatterns and its effect on query memory usage: filtering on a whole table, rather than an individual column, in the filter parameters of the Calculate() function. A lot has already been written on this particular antipattern from the point of view of query performance – for example see here and here – but it’s only with the recent addition of the Execution Metrics Profiler/Log Analytics event that you can see how bad it is for memory usage too.

Here’s a simple example. Using a semantic model based on the UK Land Registry’s Price Paid data, with two dimension tables (Date and Property Type) and a fact table called Property Transactions with just over a million rows:

…let’s say that you write a measure that counts the number of property transactions for detached properties, ie counting the rows in the Property Transactions table where the Property Type column contains the value “D”. Here’s a first attempt at that:

Detached Count V1 =
CALCULATE (
    COUNTROWS ( 'Property Transactions' ),
    FILTER ( 'Property Transactions', 'Property Transactions'[Property Type] = "D" )
)

Notice that this measure follows the antipattern of filtering on the whole Property Transactions table with the Filter() function just to filter on the Property Type column.

Here’s a DAX query that shows the results of this measure for all Postcodes. It returns around 153000 rows to show the memory impact of the way the measure is written.

EVALUATE
SUMMARIZECOLUMNS(
    'Property Transactions'[Postcode],
    "Detached Count", [Detached Count V1]
)

Here’s what the Execution Metrics event returns for this query:

{
	"timeStart": "2024-06-28T15:41:59.951Z",
	"timeEnd": "2024-06-28T15:42:00.388Z",

	"durationMs": 438,
	"vertipaqJobCpuTimeMs": 47,
	"queryProcessingCpuTimeMs": 391,
	"totalCpuTimeMs": 438,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 22246,

	"commandType": 27,
	"queryDialect": 3,
	"queryResultRows": 152583
}

In this case the memory usage of the query, as given by the approximatePeakMemConsumptionKB metrci, is 22246KB or about 23MB.

Now consider this version of the measure:

Detached Count V2 =
CALCULATE (
    COUNTROWS ( 'Property Transactions' ),
    'Property Transactions'[Property Type] = "D"
)

Running the same DAX query as before but with this new measure you get the same results but with a much lower memory usage of 8734KB or 9MB:

{
	"timeStart": "2024-06-28T15:56:09.431Z",
	"timeEnd": "2024-06-28T15:56:09.602Z",

	"durationMs": 172,
	"vertipaqJobCpuTimeMs": 31,
	"queryProcessingCpuTimeMs": 125,
	"totalCpuTimeMs": 172,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 8734,

	"commandType": 27,
	"queryDialect": 3,
	"queryResultRows": 152583
}

Query duration and CPU usage is reduced too, but as I said above this highlights how different DAX patterns can result in very different memory footprints. Filtering on just the column you want to filter, rather than the whole table, is a lot more memory efficient! If you’re getting the “This visual has exceeded the available resources error because you’ve hit the Query Memory Limit then it’s worth checking to see if you have used this antipattern in your measures and rewriting accordingly.

Footnote: it’s worth mentioning that two other variations on the rewritten measure, one using the Property Type column on the Property Type dimension table instead of the Property Type column on the fact table:

Detached Count V3 =
CALCULATE (
    COUNTROWS ( 'Property Transactions' ),
    'Property Type'[Property Type] = "D"
)

…and one that added the KeepFilters() function (which is commonly used in this type of measure):

Detached Count V4 =
CALCULATE (
    COUNTROWS ( 'Property Transactions' ),
    KEEPFILTERS ( 'Property Type'[Property Type] = "D" )
)

…both had the same memory usage as the V2 version of the measure, 9MB.

Power BI Semantic Model Memory Errors, Part 4: The Query Memory Limit

Continuing my series on Power BI memory errors (see part 1, part 2 and part 3), in this post I’ll look at the query memory limit which controls the amount of memory that an individual query can consume. This is a subject which I’ve talked about in a few blog posts in the past: hitting this limit is one of the reasons you’ll see the “This visual has exceeded the available resourceserror which I blogged about here and and here. There’s also some official documentation here which is pretty good.

As I mentioned in part 1 of this series, when you run a DAX query against a Power BI semantic model it will use memory. How much memory it uses will depend on your data, how you have modelled it, what data your query is requesting and how you have written any DAX calculations used by your query. For obvious reasons we at Microsoft can’t let queries use an infinite amount of memory in the Power BI Service so there is a limit – the query memory limit – on how much memory an individual query can use. This limit varies depending on whether you are using Shared (also known as Power BI Pro) or Premium/Fabric capacity, and by the size/SKU of the capacity you’re using. The different limits for each SKU are documented here.

While you can’t increase the query memory limit over the published thresholds, capacity admins can reduce the query memory limit in the admin portal by setting the “Query Memory Limit %” property on the capacity.

The default value of this property is 0, which means the maximum value for the capacity SKU you’re using will be applied; any other value will be interpreted as a percentage of that maximum value. Reducing this value can be a good way to stop your developers from using inefficient DAX in their reports and consuming too many resources on your capacity – it won’t directly reduce a query’s CU usage but DAX expressions that are very memory-intensive often use a lot of CUs too.

If the DAX query generated by a visual in a Power BI report hits the query memory limit you’ll see an error in your report. The first reaction many Power BI developers have when they hit the limit is to ask if there is a way to increase it – which you can only do by buying a larger capacity – but to be brutally honest, if you are hitting the query memory limit you have done something wrong and you need to fix your model, your DAX calculations or your report to reduce memory usage. Doing this is likely to improve report performance too. If you’re encountering the error in Power BI Desktop you can raise the limit but this is only possible in Desktop and if you do this you may end up getting the error again after you publish.

Here’s the error you’ll see in Power BI Desktop if you hit the query memory limit:

This visual has exceeded the available resources. Try filtering to decrease the amount of data displayed.

If you hit the query memory limit in a published report you’ll see this error message:

In this case you not only get the “Visual has exceeded the available resources” error but going to the details shows a more helpful message that tells you what the current query memory limit is:

Resource Governing: This query uses more memory than the configured limit. The query — or calculations referenced by it — might be too memory-intensive to run. Either simplify the query or its calculations, or if using Power BI Premium, you may reach out to your capacity administrator to see if they can increase the per-query memory limit. More details: consumed memory 1 MB, memory limit 1 MB. See https://go.microsoft.com/fwlink/?linkid=2159752 to learn more.

The error number associated with this error is 0xC13E0004 (-1052901372).

How do you know how much memory your query is using to check how near you are to the query memory limit? The new Execution Metrics event in Profiler and Log Analytics will tell you. If you find the event associated with the Query End event for the query (something I discussed here) the approximatePeakMemConsumptionKB metric gives the peak memory usage of the query reached while it was running, and it is this value that must not exceed the Query Memory Limit.

{
	"timeStart": "2024-06-21T16:24:16.608Z",
	"timeEnd": "2024-06-21T16:24:18.326Z",

	"durationMs": 1719,
	"vertipaqJobCpuTimeMs": 1078,
	"queryProcessingCpuTimeMs": 531,
	"totalCpuTimeMs": 1609,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 46662,

	"commandType": 27,
	"queryDialect": 3,
	"queryResultRows": 502
}

If you can’t use Profiler or Log Analytics there’s no direct way of knowing how much memory is using. You can however set a custom memory limit in Power BI Desktop and keep changing it until you see, or don’t see, the query memory limit and therefore estimate how much memory your query is using.

How can you reduce the amount of memory that your queries use? That’s a question that’s too big to answer in any one blog post and this one is already long enough. In my next few posts I will look at some common causes of excessive memory usage and what you can do to fix them.

The “DataFormat.Error: File contains corrupted data” Error In Power Query

When you’re using Excel workbooks as a data source in Power Query in either Excel or Power BI you may encounter the following error when trying to connect:

DataFormat.Error: File contains corrupted data.

There are three possible causes of this error that I know of, two of which are well documented. First, the file may actually be corrupt in which case I hope you have a backup! Second, the Excel file may be password protected. Power Query cannot connect to a password protected Excel file so you’ll need to remove the password protection before you can use it as a source.

The third is one I ran into this week and it confused me a lot. It turns out that if the Excel file has a sensitivity label applied to it that results in it being encrypted, Power Query in Excel will not be able to connect to it – although Power Query in Power BI will be able to. When connecting to an encrypted Excel file, Power Query in Power BI is able to decrypt the file using the credentials of the logged-in user and read it, but (at least at the time of writing) Power Query in Excel cannot do this and so you get the error above. In my case I had an Excel workbook that used Power Query to connect to a few other Excel workbooks, load data from them, and do some tax calculations. It was working a few months ago but when I reopened it this week I got the “DataFormat.Error” error and I couldn’t work out why. It turns out that in the meantime an admin at Microsoft had applied a company-wide policy that meant all workbooks stored in OneDrive for Business had a highly restrictive sensitivity label applied automatically – which means my Power Query queries stopped working. As soon as I changed the sensitivity label on my source workbooks to “Non Business” so they weren’t encrypted, everything worked again.

[Thanks to Curt Hagenlocher for this information]

Get Better Results From Power BI Copilot With Linguistic Modelling

Everyone is excited about Power BI Copilot, and the newly-announced preview of being able to use Copilot to ask questions about all the data in your semantic model rather than just what is shown in a report is a massive step forward. However amazing the LLMs used behind the scenes are, though, the quality of the results your users get from this new Copilot preview depends on a number of factors that you as a developer control. How you model your data is very important and as the announcement blog mentions, so is the linguistic schema that was originally added for Copilot’s predecessor feature, Q&A. Copilot returns much better results than Q&A ever did but the combination of Copilot and the extra information that Q&A’s linguistic schema provides (information Copilot could not know unless you told it) makes Copilot even more powerful. What’s more, you don’t need to edit a YAML file to use this functionality any more because most of the features of the linguistic schema are now available to edit in Power BI Desktop’s Q&A Setup dialog.

In this blog post I’ll show you a few examples of how adding to a model’s linguistic schema improves the new Power BI Copilot preview’s results when you’re querying your semantic model.

Semantic Model

Let’s say you own a farm where you grow fruit. Customers visit the farm to buy fruit and the fruit is picked for these customers by your employees. You store your sales data in a Power BI semantic model that looks like this:

The Orders table is a fact table with one row for each order line. The Order ID column identifies the order that each line item is linked to, the Amount column contains the sales value and the Units column contains the number of fruit delivered for each line. The Employee dimension gives the name of the employee who picked the fruit; the Customer dimension gives the name of the customer who ordered the fruit; the Product dimension gives the name of the fruit picked. Products are associated with Product Groups via a many-to-many relationship.

Here’s the data in the dimension tables:

To get the best results from Copilot note that:

  • The data is modelled as a classic star schema.
  • The table and column names are in human-readable English with no abbreviations and with spaces between the words. I talked about my opinions on Power BI naming conventions in this blog post.
  • All tables and columns that should not be shown in a report have been hidden.
  • The fact table measure columns have been hidden and three explicit measures – Order Amount, Order Units and Order Count (which is a distinct count on the Order ID column) – have been created.

Synonyms

While Copilot performs well on this model, let’s look at a simple question where it doesn’t return the expected result:

Show the number of orders by employee

The prompt returns a visual, but on closer inspection it’s not the result you want. It shows the count of rows in the Orders table which is the number of line items, not a count of orders:

To get the correct result you need to tell Copilot that the Order Count measure returns the number of orders by defining a Synonym. You can do this in the Q&A setup dialog in Power BI Desktop on the Synonyms tab:

Setting “number of orders” as a synonym for the Order Count measure means that the prompt now returns the following visual with the results you want:

Verbs

The next prompt to look at is:

Who picked lemons?

You know that on our farm it’s the employees who pick the fruit but there’s nothing in the model to tell Copilot that. As a result the prompt above results in Copilot saying that it doesn’t know what “picked” means in this context:

On the relationships tab of the Q&A Setup dialog you can fix this by defining a Verb relationship:

The relationship tells Copilot that “Employee names” pick “Product names” with the Orders table connecting the two columns.

With this relationship in place, Copilot correctly answers that Gabi was the only employee who picked lemons:

Nouns

The customer Chris is also widely referred to by employees as “Mr Webb”, but that name isn’t stored anywhere in the model. As a result the prompt

How much money did we make from Mr Webb?

results in the following, fairly reasonable, response:

However with a noun relationship set up to tell Copilot that “Mr Webb” is a kind of customer name where customer name equals “Chris”:

Then the result is what you would expect:

Dynamic Nouns

Copilot does a good job with the many-to-many relationship between Product and Product Group without any optimisation. For example the prompt:

show all citrus fruit and their order amounts

Returns the correct result:

But let’s say that in this case you want to show the individual products rather than the product group “citrus fruit”. You can do this by setting up a dynamic noun relationship:

The relationship is that”Product group names” define kinds of “product names” with the Product To Product Group table linking the two. With this in place the prompt now returns the desired result:

Conclusion

These examples barely scratch the surface of what’s possible with the linguistic schema and Copilot. Apart from the documentation, I found the videos on the (fairly old) “Natural Language for Power BI” YouTube channel which were created when Q&A was launched useful for understanding the concepts here too. There’s a lot to learn here but with some trial and error, as well as listening to feedback from your end users, you should be able to tune Copilot so it returns high quality results almost all the time.

Power BI Semantic Model Memory Errors, Part 3: The Command Memory Limit

Continuing my series on Power BI model memory errors (see part 1 and part 2), in this post I will look at the Command Memory Limit which restricts the amount of memory that XMLA commands like Create, Alter and most importantly Refresh can use.

If you’ve ever been told that your semantic model should consume less than half the amount of memory available to it because memory consumption can double during a full refresh, then that is because of the Command Memory Limit. Every time a model is refreshed in Power BI, that refresh is initiated by running a Refresh command. During the refresh a copy of the model is created in the background and it is the copy that is refreshed; when the refresh is completed, Power BI deletes the original version of the model and replaces it with the copy. While the refresh is in progress, the memory consumed by this copy of the model and all the operations needed to load data into it (including any Power Query queries used to get data from your data sources and to transform that data) is associated with the Refresh command. The Command Memory Limit specifies how much memory the Refresh command is allowed to use.

The good news is that there is an excellent, detailed explanation of the Command Memory Limit in the docs here which I recommend you read before continuing:

https://learn.microsoft.com/en-gb/power-bi/enterprise/troubleshoot-xmla-endpoint#resource-governing-command-memory-limit-in-premium

What it says is that the amount of memory that XMLA commands like Refresh can use is the maximum allowed size for a semantic model for the capacity you’re using (as documented in the table here in the Max Memory column) minus the amount of memory the semantic model is using when the command starts.

Let’s look at an example of an error caused by exceeding the Command Memory Limit.

I created an Import mode semantic model whose total size was 3.3GB, and which consisted of a single partitioned table with 20 columns, each of which contained random decimal numbers. I refreshed this model on an F64 capacity where the maximum allowed memory per model is 25GB and the refresh succeeded.

How much memory did the Refresh command use? There are two ways to find out. For a while now you have been able to get the approximate peak memory usage during a refresh (along with the approximate peak memory usage just for Power Query queries, which is a subset of this figure) from the Command End event associated with that refresh in Profiler and Log Analytics; I blogged about this here. However the new Execution Metrics event in Profiler and Log Analytics makes the same information even easier to extract. Here’s what the Execution Metrics event for the refresh looked like in this case:

{
	"timeStart": "2024-05-26T17:18:29.984Z",
	"timeEnd": "2024-05-26T17:26:06.577Z",

	"durationMs": 456593,
	"directQueryExecutionTimeMs": 6068,
	"vertipaqJobCpuTimeMs": 448938,
	"mEngineCpuTimeMs": 362094,
	"totalCpuTimeMs": 1565781,
	"executionDelayMs": 370,

	"approximatePeakMemConsumptionKB": 6074967,
	"mEnginePeakMemoryKB": 530688,

	"tabularConnectionTimeoutMs": 18000000,

	"commandType": 2,
	"refreshParallelism": 6,
	"vertipaqTotalRows": 15000000,
	"qsoReplicaVersion": 133612179663049757,
	"intendedUsage": 2
}

[Note: if you’re looking in Profiler or Log Analytics you’ll see a lot of Execution Metrics events. Typically the Execution Metrics events for a refresh will be generated immediately after the Command End event for that refresh, but to be sure you should look for matching values in the RequestId column (in a Profiler trace) or the XmlaRequestId column (in Log Analytics) to associate an Execution Metrics event with a Command End event.]

The metric to look at above is approximatePeakMemConsumptionKB and it shows the refresh used about 6,074,967KB or 5.8GB at its peak – a lot more than double the size of the model before or after the refresh. I designed the model specifically for this to happen (20 columns of random decimal numbers is not easy to compress) and in most cases the memory usage will be lower relative to the size of the model.

I then scaled the capacity down to an F16 which only has a memory limit of 5GB for semantic models and refreshed again. As you would expect, the refresh failed with the following error:

Resource Governing: This operation was canceled because there wasn’t enough memory to finish running it. Either reduce the memory footprint of your dataset by doing things such as limiting the amount of imported data, or if using Power BI Premium, increase the memory of the Premium capacity where this dataset is hosted. More details: consumed memory 1780 MB, memory limit 1779 MB, database size before command execution 3340 MB. See https://go.microsoft.com/fwlink/?linkid=2159753 to learn more. The current operation was cancelled because another operation in the transaction failed.

This is the error message associated with hitting the Command Memory Limit (the associated error number is -1052901373). What this message is saying is that the model was consuming 3340MB (3.3GB) before the refresh started, then the refresh itself reached a maximum of 1780MB (1.7GB) but at that point it was cancelled because 5GB-3.3GB=1.7GB. Here’s the Execution Metrics data for the refresh Command:

{
	"timeStart": "2024-05-27T10:11:53.985Z",
	"timeEnd": "2024-05-27T10:12:56.834Z",

	"durationMs": 62849,
	"directQueryExecutionTimeMs": 3167,
	"mEngineCpuTimeMs": 29813,
	"totalCpuTimeMs": 80375,
	"executionDelayMs": 0,

	"approximatePeakMemConsumptionKB": 1821523,
	"mEnginePeakMemoryKB": 518648,

	"tabularConnectionTimeoutMs": 18000000,

	"commandType": 2,
	"errorCount": 2,
	"refreshParallelism": 4,
	"vertipaqTotalRows": 933137,
	"intendedUsage": 2
}

So is it even going to be possible to refresh this model on an F16 capacity given that refreshing it requires 5.9GB? Well the most memory efficient way of refreshing a semantic model without changing it is to clear all the data out from it first, then refresh just the data in each partition in each table one at a time, then do a recalc, all in separate transactions – which is exactly what the partialBatch commit mode of the Enhanced Refresh API (which I blogged about recently here) does if you set its max_parallelism property to 1. So I tried this – and it failed again ☹️. Unsurprisingly the first refresh operation of type ClearValues ran successfully and peaked at 0.01GB of memory, after which the size of the model in memory would have been negligible. Then the refresh of type DataOnly for first partition in the table was successful but the refresh operation peaked at 3.4GB. The second partition refresh of type DataOnly then failed; I assume it would have also peaked at 3.4GB but the error message told me that the model was already 2.4GB in size when the refresh started so the memory limit for this refresh was 2.6GB.

If you’re hitting the Command Memory Limit and you don’t want to increase the size of your capacity but you are willing to make changes to your model, then there could be several ways to reduce the amount of memory used during a refresh – but in order to understand which method will work you will need to know what is using memory during the refresh and that isn’t easy. Using the partialBatch commit option with the Enhanced Refresh API does at least split the refresh out into its constituent parts and makes it easier to see at which stage the refresh fails. In this case refresh failed at the DataOnly stage so creating more, smaller partitions could help. Following the steps for reducing model size given in blog posts like this one by Nikola Illich will also help a lot to reduce memory consumption during a refresh. Other common culprits for high memory consumption during a refresh include calculated columns and tables, or Power Query queries that buffer large amounts of data in memory (for example because of transformations that sort or aggregate data and which do not fold). Therefore following Roche’s maxim and doing all your transformations and creating the equivalent of your calculated columns and tables in your data source, before the data is loaded into Power BI, will help. If you do have to use calculated columns and tables then you should look at tuning your DAX expressions so they use less memory. I’ll look at some examples of how to tune your Power Query queries or DAX expressions to reduce memory in future posts.