If you read this post that was published on the Fabric blog back in July, you’ll know that each Power Query query in a Fabric Gen2 dataflow has a property that determines whether its output is staged or not – where “staged” means that the output is written to the (soon-to-be hidden) Lakehouse linked to the dataflow, regardless of whether you have set a destination for the query output to be written to. Turning this on or off can have a big impact on your refresh times, making them a lot faster or a lot slower. You can find this property by right-clicking on the query name in the Queries pane:
At the moment this property is on by default for every query although this may change in the future. But should you turn it on for the queries in your Gen2 dataflows? It depends, and you should test to see what gives you the best performance.
Let’s see a simple example. I uploaded a CSV file from my favourite data source, the Land Registry price paid data, with about a million rows in it to the files section of a Lakehouse, then created a query that did a group by on one of the columns to find the number of property transactions by each county in England and Wales. The query was set to load its output to a table in a Warehouse.
Here’s the diagram view for this query:
I then made sure that staging was turned off for this query:
This means that the Power Query engine did the group by itself as it read the data from the file.
Looking at the refresh history for this dataflow:
…showed that the query took between 18-24 seconds to run. Clicking on an individual refresh to see the details:
…showed a single activity to load the output to the Warehouse. Clicking on this activity to see more details:
…shows how long it took – 15 seconds – plus how many rows were loaded to the destination Warehouse and how much data.
I then created a second dataflow to see the effect of staging. It’s important to understand that copying the previous dataflow and enabling staging on the only query in it does not do what I wanted here: I had to create two queries, one with staging enabled (called PP here) and no destination set to stage all the raw data from the CSV file, and a second one (called Counties here) that references the first with staging disabled and its destination set to the Warehouse I used in the previous dataflow to do the group by.
Here’s the diagram view for these two queries:
Note the blue outline on the PP query which indicates that it’s staged and the grey outline on the Counties query that indicates that it is not staged.
Looking at the Refresh History for this dataflow showed that it took around 40 seconds to run on average:
Looking at the first level of detail for the last refresh showed the extra activity for staging the data:
Clicking on the details for this staging activity for the PP table showed that it took 17 seconds to load all the raw data:
The activity to write the data to the Warehouse took about the same as with the first dataflow:
In summary, the first dataflow clearly performs better than the second dataflow. In this case, therefore, it looks like the overhead of staging the data made the performance worse.
Don’t take this simple example to prove a general rule: every dataflow will be different and there are a lot of performance optimisations planned for Dataflows Gen2 over the next few months, so you should test the impact of staging for yourself. I can imagine for different data sources (a Lakehouse source is likely to perform very well, even for files) and different transformations then staging will have a positive impact. On the other hand if you’re struggling with Dataflows Gen2 performance, especially at the time of writing this post, turning off staging could lead to a performance improvement.
In the second post in this series I discussed a KQL query that can be used to analyse Power BI refresh throughput at the partition level. However, if you remember back to the first post in this series, it’s actually possible to get much more detailed information on throughput by looking at the ProgressReportCurrent event, which fires once for every 10000 rows read during partition refresh.
Here’s yet another mammoth KQL query that you can use to analyse the ProgressReportCurrent event data:
It filters the Log Analytics data down to get events from the last day and just the ProgressReportCurrent events, as well as the ProgressReportBegin/End events which are fired before and after the ProgressReportCurrent events.
It then splits the data into groups of rows (‘partitions’ in KQL, but of course not the partitions that are being refreshed) by a combination of XmlaRequestId (ie the refresh operation) and XmlaObjectPath (ie the partition that is being refreshed)
For each group of rows it will then:
Find the ProgressReportBegin event and from this get the time when data started to be read from the source
Get all subsequent ProgressReportCurrent events and calculate the amount of time elapsed since the previous event (which might be the ProgressReportBegin event or a previous ProgressReportCurrent event) and the number of rows read
When the ProgressReportEnd event is encountered, calculate the amount of time elapsed since the previous ProgressReportCurrent event and the number of rows (which will be less than 10000) read since then
Filter out the ProgressReportBegin events because we don’t need them any more
Finally, add columns that splits out the table name and partition name and calculates the number of rows read per second for each row by dividing the number of rows read for each event by the amount of time elapsed since the previous event
What can this query tell us about throughput?
First of all, something interesting but not necessarily useful. At least for the data source I’m using for my tests, when I plot a column chart with the number of rows read on the x axis and the amount of time elapsed since the last event on the y axis (ie the amount of time it takes to read 10000 rows for all but the last column) then I noticed that every 200000 rows something happens to slow down the read:
I have no idea what this is, whether it’s a quirk of this particular source or connector, but it’s a great example of the kind of patterns that become obvious when you visualise data rather than look at a table of numbers.
Plotting time on the x axis of a line chart and the cumulative total of rows read on the y axis gives you something more useful. Here’s the chart for one of the refreshes mentioned in my last post where four partitions of the same table are refreshed in parallel:
In this case throughput is fine up until the end of the refresh at which point something happens to the February, March and April partitions but not the January partition to slow them down for about 30 seconds, after which throughput goes back to what it was before. Here’s the same chart zoomed in a bit:
Here’s the same problem shown in the first graph above, where the number of rows read is on the x axis, showing how for example with the April partition there’s a sudden spike where it takes 14 seconds to read 10000 rows rather than around 0.3 seconds:
What is this, and why isn’t the January partition affected? Maybe it was a network issue or caused by something happening in the source database? Looking at another refresh that also refreshes the same four partitions in parallel, it doesn’t seem like the same thing happens – although if you look closely at the middle of the refresh there might be a less pronounced flattening off:
Again, the point of all this is not the mysterious blips I’ve found in my data but the fact that if you take the same query and look at your refreshes, you may find something different, something more significant and something you can explain and do something about.
In the first post in this series I described the events in Log Analytics that can be used to understand throughput – the speed that Power BI can read from your dataset when importing data from it – during refresh. While the individual events are easy to understand when you look at a simple example they don’t make it easy to analyse the data in the real world, so here’s a KQL query that takes all the data from all these events and gives you one row per partition per refresh:
//Headline stats for partition refresh with one row for each partition and refresh
//Get all the data needed for this query and buffer it in memory
let RowsForStats =
materialize(
PowerBIDatasetsWorkspace
| where TimeGenerated > ago(1d)
| where OperationName == "ProgressReportEnd"
| where OperationDetailName == "ExecuteSql" or OperationDetailName == "ReadData"
or (OperationDetailName == "TabularRefresh" and (EventText contains "partition"))
);
//Get just the events for the initial SQL execution phase
let ExecuteSql =
RowsForStats
| where OperationDetailName == "ExecuteSql"
| project XmlaRequestId, XmlaObjectPath,
ExecuteSqlStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ),
ExecuteSqlEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ),
ExecuteSqlDurationMs = DurationMs, ExecuteSqlCpuTimeMs = CpuTimeMs;
//Get just the events for the data read and calculate rows read per second
let ReadData =
RowsForStats
| where OperationDetailName == "ReadData"
| project XmlaRequestId, XmlaObjectPath,
ReadDataStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ),
ReadDataEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ),
ReadDataDurationMs = DurationMs, ReadDataCpuTime = CpuTimeMs,
TotalRowsRead = ProgressCounter, RowsPerSecond = ProgressCounter /(toreal(DurationMs)/1000);
//Get the events for the overall partition refresh
let TabularRefresh =
RowsForStats
| where OperationDetailName == "TabularRefresh"
| parse EventText with * '[MashupCPUTime: ' MashupCPUTimeMs:long ' ms, MashupPeakMemory: ' MashupPeakMemoryKB:long ' KB]'
| project XmlaRequestId, XmlaObjectPath,
TabularRefreshStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ),
TabularRefreshEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ),
TabularRefreshDurationMs = DurationMs, TabularRefreshCpuTime = CpuTimeMs,
MashupCPUTimeMs, MashupPeakMemoryKB;
//Do an inner join on the three tables so there is one row per partition per refresh
ExecuteSql
| join kind=inner ReadData on XmlaRequestId, XmlaObjectPath
| join kind=inner TabularRefresh on XmlaRequestId, XmlaObjectPath
| project-away XmlaRequestId1, XmlaRequestId2, XmlaObjectPath1, XmlaObjectPath2
| extend Table = tostring(split(XmlaObjectPath,".", 2)[0]), Partition = tostring(split(XmlaObjectPath,".", 3)[0])
| project-reorder XmlaRequestId, Table, Partition
| order by XmlaRequestId, ExecuteSqlStartTime desc
It’s a bit of a monster query but what it does is quite simple:
First it gets all the events relating to partition refresh in the past 1 day (which of course you can change) and materialises the results.
Then it filters this materialised result and gets three sets of tables:
All the ExecuteSql events, which tell you how long the data source took to start returning data and how much CPU time was used.
All the ReadData events, which tell you how long Power BI took to read all the rows from the source after the data started to be returned, how much CPU time was used, and how many rows were read. Dividing duration by rows read lets you calculate the number of rows read per second during this phase.
All the TabularRefresh events, which give you overall data on how long the partition refresh took, how much CPU time was used, plus information on Power Query peak memory usage and CPU usage.
What can this tell us about refresh throughput though? Let’s use it to answer some questions we might have about throughput.
What is the impact of parallelism on throughput? I created a dataset on top of the NYC taxi data Trip table with a single table, and in that table created four partitions containing data for January, February, March and April 2013, each of which contained 13-15 million rows. I won’t mention the type of data source I used because I think it distracts from what I want to talk about here, which is the methodology rather than the performance characteristics of a particular source.
I then ran two refreshes of these four partitions: one which refreshed them all in parallel and one which refreshed them sequentially, using custom TSML refresh commands and the maxParallelism property as described here. I did a refresh of type dataOnly, rather than a full refresh, in the hope that it would reduce the number of things happening in the Vertipaq engine during refresh that might skew my results. Next, I used the query above as the source for a table in Power BI (for details on how to use Log Analytics as a source for Power BI see this post; I found it more convenient to import data rather than use DirectQuery mode though) to visualise the results.
Comparing the amount of time taken for the SQL query used to start to return data (the ExecuteSqlDurationMs column from the query above) for the four partitions for the two refreshes showed the following:
The times for the four partitions vary a lot for the sequential refresh but are very similar for the parallel refresh; the January partition, which was refreshed first, is slower in both cases. The behaviour I described here regarding the first partition refreshed in a batch could be relevant.
Moving on to the Read Data phase, looking at the number of rows read per second (the RowsPerSecond column from the query above) shows a similar pattern:
There’s a lot more variation in the sequential refresh. Also, as you would expect, the number of rows read per second is much higher when partitions are refreshed sequentially compared to when they are refreshed in parallel.
Looking at the third main metric, the overall amount of time taken to refresh each partition (the TabularRefreshDurationMs column from the query above) again shows no surprises:
Each individual partition refreshes a lot faster in the sequential refresh – almost twice as fast – compared to the parallel refresh. Since four partitions are being refreshed in parallel during the second refresh, though, this means that any loss of throughput for an individual partition as a result of refreshing in parallel is more than compensated for by the parallelism, making the parallel refresh faster overall. This can be shown using by plotting the TabularRefreshStartTime and TabularRefreshEndTime columns from the query above on a timeline chart (in this case the Craydec Timelines custom visual) for each refresh and each partition:
On the left of the timeline you can see the first refresh where the partitions are refreshed sequentially, and how the overall duration is just over 20 minutes; on the right you can see the second refresh where the partitions are refreshed in parallel, which takes just under 10 minutes. Remember also that this is just looking at the partition refresh times, not the overall time taken for the refresh operation for all partitions, and it’s only a refresh of type dataOnly rather than a full refresh.
So does this mean more parallelism is better? That’s not what I’ve been trying to say here: more parallelism is better for overall throughput in this test but if you keep on increasing the amount of parallelism you’re likely to reach a point where it makes throughput and performance worse. The message is that you need to test to see what the optimal level of parallelism – or any other factor you can control – is for achieving maximum throughput during refresh.
These tests only show throughput at the level of the ReadData event for a single partition, but as mentioned in my previous post there is even more detailed data available with the ProgressReportCurrent event. In my next post I’ll take a closer look at that data.
[Thanks to Akshai Mirchandani for providing some of the information in this post, and hat-tip to my colleague Phil Seamark who has already done some amazing work in this area]
If you’re tuning refresh for a Power BI Import mode dataset one of the areas you’ll be most interested in is throughput, that is to say how quickly Power BI can read data from the data source. It can be affected by a number of different factors: how quickly the data source can return data; network latency; the efficiency of the connector you’re using; any transformations in Power Query; the number of columns in the data and their data types; the amount of other objects in the same Power BI dataset being refreshed in parallel; and so on. How do you know if any or all of these factors is a problem for you? It’s a subject that has always interested me and now that Log Analytics for Power BI datasets is GA we have a powerful tool to analyse the data, so I thought I’d do some testing and write up my findings in a series of blog posts.
The first thing to understand is what events there are in Log Analytics that provide data on throughput. The events to look at are ProgressReportBegin, ProgressReportCurrent and ProgressReportEnd (found in the OperationName column), specifically those with OperationDetailName ExecuteSql, ReadData and Tabular Refresh. Consider the following KQL query that looks for at this data for a single refresh and for a single partition:
let RefreshId = "e5edc0de-f223-4c78-8e2d-01f24b13ccdc";
let PartitionObjectPath = "28fc7514-d202-4969-922a-ec86f98a7ea2.Model.TIME.TIME-ffca8cb8-1570-4f62-8f04-993c1d1d17cb";
PowerBIDatasetsWorkspace
| where TimeGenerated > ago(3d)
| where XmlaRequestId == RefreshId
| where XmlaObjectPath == PartitionObjectPath
| where OperationDetailName == "ExecuteSql" or OperationDetailName == "ReadData" or OperationDetailName == "TabularRefresh"
| project XmlaObjectPath, Table = split(XmlaObjectPath,".", 2)[0], Partition = split(XmlaObjectPath,".", 3)[0],
TimeGenerated, OperationName, OperationDetailName, EventText, DurationMs, CpuTimeMs, ProgressCounter
| order by XmlaObjectPath, TimeGenerated asc;
Some notes on what this query does:
All the data comes from the PowerBIDatasetsWorkspace table in Log Analytics
I’ve put an arbitrary filter on the query to only look for data in the last three days, which I know contains the data for the specific refresh I’m interested in
An individual refresh operation can be identified by the value in the XmlaRequestId column and I’m filtering by that
An individual partition being refreshed can be identified by the value in the XmlaObjectPath column and I’m filtering by that too
The value in the XmlaObjectPath column can be parsed to obtain both the table name and partition name
The query filters the events down to those mentioned above
Here are some of the columns from the output of this query, showing data for the refresh of a single table called TIME with a single partition:
Some more notes on what we can see here:
The TimeGenerated column gives the time of the event. Although the time format shown here only shows seconds, it actually contains the time of the event going down to the millisecond level – unlike in a Profiler trace, where time values are rounded to the nearest second and are therefore lot less useful.
The first event returned by this query is a ProgressReportBegin event of type TabularRefresh which marks the beginning of the partition refresh.
As I blogged here, after that are a pair of ProgressReportBegin/End events of type ExecuteSql. The value in the DurationMs column tells you how long it takes for the data source (which includes time taken by the actual query generated against the data source and any Power Query transformations) to start to return data – which was, in this case, 6016 milliseconds or 6 seconds.
Next there is a ProgressReportBegin event which indicates the beginning of data being read from the source.
After that there are a series of ProgressReportCurrent events which mark the reading of chunks of 10000 rows from the source. The ProgressCounter column shows the cumulative number of rows read.
Next there is a ProgressReportEnd event that marks the end of the data read. The ProgressCounter shows the total number of rows read (which must be less than 10000 rows greater than the value in the ProgressCounter column for the previous ProgressReportCurrent event); the DurationMs column shows the total time taken to read the data. In this case 86430 rows of data were read in 1.4 seconds.
Finally there is a ProgressReportEnd event of type TabularRefresh, the pair of the first event shown returned. It not only shows the total amount of time taken and the CPU time used to refresh the partition (which includes other operations that are nothing to do with throughput such as compressing data and building dictionaries) in the DurationMs column, as I blogged here it also includes data on the CPU and peak memory used by Power Query during the refresh. In this case the total refresh time was 7.6 seconds, so not much more than the time taken to read the data.
Clearly there’s a lot of useful, interesting data here, but how can we analyse it or visualise it? Stay tuned for my next post…
If you’re excited about Direct Lake mode in Fabric you’re probably going to want to test it with some of your own data, and in particular look at DAX query performance. Before you do so, though, there are a few things to know about performance testing with Direct Lake datasets that are slightly different from what you might be used to with Import mode or DirectQuery datasets.
Dataset “hotness”
In my last post I talked about how, in Direct Lake datasets, Power BI can page individual column segments, dictionaries and join indexes into memory on demand when a DAX query is run and how those artefacts may get paged out later on. It therefore follows that there are four possible states or levels of “hotness” that a dataset can be in when a DAX query is run and that each of these states will have different performance characteristics:
The column segments, dictionaries and join indexes needed to answer a query are not held in memory and need to be paged in before the query can run.
Some of the column segments, dictionaries and join indexes needed to answer a query are not held in memory and need to be paged in, while some of them are already in memory.
All of the column segments, dictionaries and join indexes needed by the query are already held in memory.
All of the column segments, dictionaries and join indexes needed by the query are already held in memory and, as a result of previous query activity, Vertipaq engine caches useful for the query are also already populated.
State (1) is the “coldest” state and will give the worst possible query performance while state (4) is the “hottest” state and will give the best possible query performance.
When you’re testing the performance of a DAX query on a Direct Lake dataset you should test it on a dataset that is in state (1), state (3) and state (4) so you get a good idea of how much time is taken to page data into memory and how much of a performance improvement Vertipaq engine caching brings.
Ensuring everything is paged out
To test query performance in state (1) you need a way to ensure that all column segments, dictionaries and join indexes are paged out of memory. At the time of writing this post you can ensure this simply by refreshing the dataset. This will change sometime in the next few months though, because paging everything out of memory when you refresh is not ideal behaviour in the real world, so there will be another way to ensure everything is paged out. I’ll update this post when that change happens. You can ensure that all column segments and dictionaries have been paged out by running the two DMV queries mentioned in my previous post: DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS and DISCOVER_STORAGE_TABLE_COLUMNS.
Why you should use DAX Studio for performance testing
I also recommend using DAX Studio to run queries when performance testing, for a number of reasons. First of all it makes it easy to clear the Vertipaq engine cache before a query is run with the “Clear Cache” and “Clear on Run” (which automatically clears the cache before each query) buttons. This not only runs a Clear Cache command, to clear the Vertipaq cache, it also runs a (hidden) DAX query that does not query any data from the dataset but which does trigger the creation of all the measures on the dataset. In most cases this is very fast, but if you have thousands of measures it could take a few seconds (similar to what I show here for report-level measures). If you are not using DAX Studio you can achieve the same result by running a query like:
EVALUATE {1}
DAX Studio also lets you run the same query multiple times using its “Run benchmark” feature (although this only lets you test for states (3) and (4) at the time of writing) and its “Server Timings” feature is invaluable for understanding what’s going on inside the Vertipaq engine when a query runs.
Also make sure you are running the very latest version of DAX Studio (which is 3.0.8 at the time of writing) to make sure it works properly with Fabric.
Performance testing methodology
So, putting this all together, in order to run a single performance test on a Direct Lake dataset to capture performance for states (1), (3) and (4):
Preparation
Create a custom Power BI dataset in Fabric (not everything here works with a default dataset at the time of writing)
Open your report in Power BI Desktop connected to your published Direct Lake dataset
Capture the DAX queries generated by the visuals you want to test using Performance Analyzer by clicking Copy Query (see here for how to do this)
Install the very latest version of DAX Studio and open it
Connect DAX Studio to your workspace’s XMLA Endpoint (see here for how to do this)
Paste your DAX query into DAX Studio
Turn on DAX Studio’s Server Timings option
Ensure the “Clear on Run” option in the ribbon is turned off
Click the “Clear Cache” button on the ribbon in DAX Studio
Run the DAX query
Save the output of the Server Timings pane in DAX Studio by clicking the Export button
To test performance for state (3), immediately after the previous steps:
Click the “Clear Cache” button on the ribbon in DAX Studio
Run the DAX query
Save the output of the Server Timings pane in DAX Studio
To test the performance for state (4), immediately after the previous steps:
Run the DAX query again without clicking the “Clear Cache” button
Save the output of the Server Timings pane
Fallback to DirectQuery
Even with a Direct Lake dataset there is no guarantee that your query will be answered in Direct Lake mode: in as-yet not fully documented scenarios (but basically if your data volumes are too large for the capacity you’re using) Power BI will switch to using DirectQuery mode against the Lakehouse’s SQL Endpoint. One of the objectives of your performance testing should be to make sure that this happens as infrequently as possible because DirectQuery mode will perform noticeably worse than Direct Lake mode.
You may notice some DirectQuery events even when the query itself only uses Direct Lake; these are used to get metadata or security information from the Lakehouse and can be ignored. Here’s an example of this:
Load testing
Testing for a single user is important, but don’t forget about testing performance with multiple concurrent users. As I discuss here, realistic load tests are the only way to get a good idea of how your report will actually perform in production and you can load test a Direct Lake dataset in exactly the same way as an Import mode or DirectQuery dataset.
Direct Lake is still in preview!
The last point to make is that like the rest of Fabric, Direct Lake is still in preview. This not only means that functionality is missing, it also means that performance is not yet as good as it will be yet. So, by all means test Direct Lake and tell us how fast (or not) it is, but be aware that your test results will be out of date very quickly as the engine evolves.
[Thanks to Krystian Sakowski for the information in this post]
This error only used to appear in the Power BI Service, but the good news is – and trust me, this is good news – it may now appear in Power BI Desktop too following the May 2023 release.
First, a quick recap of what causes this error. The Power BI Service has finite resources so we at Microsoft don’t allow you to run queries there that take hours to complete or consume vast amounts of memory; we impose limits on query duration and memory usage, and you’ll see this error in the Power BI Service when you hit these limits. The problem has been that, up until recently, these limits were not imposed in Power BI Desktop so it was easy to develop inefficient reports and datasets that worked ok (if a little slowly) on the developer’s PC but then caused errors after they were published. What has changed is that these limits are now imposed in Power BI and they are also configurable there.
How do you know if you are running into these limits? You’ll see an error in your visual in Power BI Desktop that looks like this:
The error message is:
Visual has exceeded the available resources
If you click on the “See details” link you’ll see the following dialog:
Resources Exceeded This visual has exceeded the available resources. Try filtering to decrease the amount of data displayed.
What should you do if you encounter this error? The most important thing is to understand why it’s happening. There are two possible causes:
Your query is taking too long to run. For the Power BI Service, the default query timeout is 225 seconds although it is possible for an admin to reduce this; unless you’ve set a custom limit or you’re not using a Power BI dataset as a source, it’s likely that 225 seconds is the longest that a query will run for in Power BI Desktop.
Your query is using too much memory. This is probably because you are doing an inefficient DAX calculation on a very large table (filtering on entire table rather than a single column is a classic anti-pattern, for example).
As a result you need to do some tuning. “But Chris!”, you say, “my query has to run for longer than 225 seconds! It’s too difficult to tune!” To which I say – tough, you don’t have a choice. Slow, inefficient queries result in unhappy users so if you don’t fix these issues you have an even bigger problem. This video of Marco’s is a good place to start if you want to learn about performance tuning.
In order to do that tuning though (or if you just want to get stuff done before you do any tuning, or if you think the wrong limits are being imposed) you’ll need to turn off the limits so you can capture the DAX query for the offending visual using Performance Analyzer. To do this, go to File/Options and settings/Options to open the Options dialog, go to the Report settings pane and scroll to the bottom to see the Query limit simulations section:
You can either use the “Custom limits” option, as shown in the screenshot above, to set your own limits (enter 0 in one of these boxes for no limits to be imposed) or use the “No query limits” for no limits. You should only use these options temporarily, remember, otherwise you run the risk of getting this error in the Power BI Service later on!
It’s worth mentioning that the limits imposed in Power BI Desktop are only imposed on queries generated by Power BI Desktop itself, which means that they won’t affect external tools like DAX Studio that can also be used to query a dataset in Power BI Desktop. You can see how the limits are imposed by running a Profiler trace on Power BI Desktop, finding the Query Begin event for each query and looking for the Timeout and DbpropMsmdRequestMemoryLimit properties in the Property List shown under the query text:
Also, these settings are saved on a per-file basis, so if you create a new .pbix file it will have the default settings and not the settings you made in any other .pbix file.
Here’s a recording of a session I did for the Manchester (UK) Power BI user group recently on best practices for DirectQuery mode in Power BI:
I’ve done it for a few other groups over the last six months but this is the latest and best version, I think.
I’ve worked with several customers using DirectQuery mode since I joined the CAT team and learned a lot along the way. Some of this knowledge has been written up by me (eg this post on Snowflake DirectQuery mode best practices), some of it by the people who work with the data sources Power BI runs on top of (see posts by Dany Hoter on the Azure Data Explorer blog, for example; there’s also going to be a lot of new material on DirectQuery for Databricks coming soon, with this post as a start). There’s a lot of detailed information in the docs too, for example here, here and here.
But remember folks: the most important piece of advice around DirectQuery is to not use it unless you’re really, really, really sure you have no other option. It’s possible to make it work well but it takes a lot more tuning and specialist skill than Import mode!
Another new metric has appeared in Profiler/Log Analytics recently, added to the end of the query text shown in the TextData column for the Query End event. It’s called WaitTime:
What does WaitTime represent? Here’s the technical explanation: it’s the wait time on the query thread pool in the Analysis Services engine before the query starts to run. But what does this mean for you as someone trying to tune DAX queries in Power BI?
As I explained recently in my post on load testing in Power BI, when you publish your dataset to the Power BI Service it runs on an instance of a version of the Analysis Services engine on a node somewhere inside the infrastructure that we at Microsoft manage for you. You don’t see any of this, but there’s still a machine with a limited amount of CPU and memory available and it can get overloaded if there are too many expensive queries running, or too many datasets being refreshed at the same time, or even one large dataset being refreshed; part of the magic of the Power BI Service is how we move datasets around to ensure this happens as infrequently as possible. If, however, you’re very unlucky you may run a query on a dataset that’s running on an overloaded machine and the performance of that query may be affected because the CPU is too busy. In that case you may see a value for WaitTime that is greater than zero and this means that your query was slower than it might otherwise have been.
This can happen for datasets in both Import mode and DirectQuery mode. It can happen because of something that’s happening on the dataset you’re querying – such as a load test or a full refresh – or it can happen because someone else is doing something similar with a dataset that happens to be on the same node as yours and the Power BI Service hasn’t managed to move your dataset to a quieter node yet. It is not the same thing as throttling in Premium (or “interactive request delays” as it’s officially called) but if you have a capacity that’s being throttled because of something you’re doing with a specific dataset then queries on that dataset may also have a WaitTime greater than zero.
I haven’t seen a non-zero value for WaitTime yet, but I only found out about it yesterday so I’m looking forward to seeing whether I see one next time I’m doing some performance tuning. If you see one let me know in the comments.
[Thanks once again to Akshai Mirchandani for this information]
This post is a follow-up to my recent post on identifying CPU and memory-intensive Power Query queries in Power BI. In that post I pointed out that Profiler and Log Analytics now gives you information on the CPU and memory used by an individual Power Query query when importing data into Power BI. What I didn’t notice when I wrote that post is that there is also now information available in Profiler and Log Analytics that tells you about peak memory and CPU usage across all Power Query queries for a single refresh in the Power BI Service, as well as memory usage for the refresh as a whole.
Using the same dataset from my previous post, I ran a Profiler trace on the workspace and captured the Command Begin and Command End events while I refreshed the dataset. Here’s what Profiler shows for the Command End event:
In the upper pane, the Duration column tells you how long the refresh took in milliseconds and the CPUTime column tells you how much CPU was used by both the Analysis Services engine and the Power Query engine during the refresh. This is not new, and I wrote about the CPUTime column last year here.
In the lower pane where the TMSL for the refresh operation is shown – this is the text from the TextData column – underneath the TMSL there are three new values shown:
PeakMemory shows the maximum amount of memory used during the refresh operation. This includes memory used by the Analysis Services engine and the Power Query engine.
MashupCPUTime shows the total amount of CPU used by the Power Query engine for all Power Query queries used to import data into the dataset. This value will always be less than the value shown in the CPUTime column in the upper pane.
MashupPeakMemory shows the maximum amount of memory used during the refresh operation by just the Power Query engine across all query evaluations. It’s important to note that this value may not be entirely accurate since the memory usage values, and therefore the peaks, are captured asynchronously so there could be higher peaks that are not detected.
This new information should be extremely useful for troubleshooting refresh failures that are caused by excessive memory consumption.
[Thanks to Akshai Mirchandani for this information]
If you’re about to put a big Power BI project into production, have you planned to do any load testing? If not, stop whatever you’re doing and read this!
In my job on the Power BI CAT team at Microsoft it’s common for us to be called in to deal with poorly performing reports. Often, these performance problems only become apparent after the reports have gone live: performance was fine when it was just one developer testing it, but as soon as multiple end-users start running reports they complain about slowness and timeouts. At this point it’s much harder to change or fix anything because time is short, everyone’s panicking and blaming each other, and the users are trying to do their jobs with a solution that isn’t fit for purpose. Of course if the developers had done some load testing then these problems would have been picked up much earlier.
What is the cause of the problem? As the saying goes, the cloud is just somebody else’s computer and when you publish your datasets to the Power BI Service they run (at least at the time of writing) on a single physical node – which of course has a finite amount of memory and CPU. If you have too many users trying to run reports at the same time then that memory and CPU will be exhausted and performance will suffer. This is a problem that the new dataset scale-out feature for Premium, currently in preview, will partially solve by creating multiple replicas of your dataset that run on separate physical nodes in the Power BI Service. I say “partially solve” because there is another constraint: the amount of memory and CPU that is available for use in Shared capacity (commonly called Pro) or with Premium Per User, neither of which are fully documented, or that you have bought with your Premium capacity. If you try to use more memory than you’re allowed you’ll get errors; if you try to use more CPU than is available you’ll be throttled. It’s worth taking some time reading the Premium documentation to understand how limits on CPU usage are calculated and enforced.
As a result of all this it’s important to check that normal production usage of your dataset does not result in any of these limits being hit. These limits are very generous and in the vast majority of cases your reports will perform well without you doing anything, but if you’re working with large data volumes, complex calculations, reports with many visuals and/or thousands of users you can’t assume that everything will be ok. This applies whether you’re building an internal BI solution or using Power BI Embedded to do B2B reporting; similar questions are raised when you’re trying to work out how much Premium capacity you should buy to support a large number of small self-service solutions, or how much Premium capacity is needed when you’re migrating from Azure Analysis Services. If you’re using DirectQuery mode then you might be surprised at how much CPU still gets used on your Premium capacity, and of course you have the additional task of ensuring your data source can handle all the queries Power BI needs to run on it; if you’re connecting to your data source via an on-premises data gateway then you need to ensure the machine your gateway is running on is properly specified too.
How can you do load testing with Power BI? There are a few options for automated load testing, but the most widely used is the Realistic Load Test Tool (see also my colleague Sergei Gundorov’s simplified version here) and Kristyna Hughes has a great post on how to use it here. If automated testing sounds too complex then you can always round up ten or so people and get them to run through a series of interactions on the reports; this won’t give you results you can compare reliably from test to test, but it will tell you whether you have a problem or not. While running a load test you should measure report rendering performance on the front-end – the Realistic Load Testing Tool will capture this information for you – and if you’re using Premium you should also capture information on the back-end either by running a Profiler trace or enabling Log Analytics on your workspace, so you can get accurate DAX query timings. The Premium Capacity Metrics App will tell you how close you’re coming to the limits of your Premium capacity.
One mistake that people often make when load testing Power BI is over-estimating how many concurrent users they’ll have. Even with an enterprise BI solution it’s relatively rare to have more than ten people running queries at exactly the same time. The most concurrent users I’ve ever seen is at a large retail customer where, every morning at 8:30am, staff in every store had a meeting where they all looked at the same Power BI report and even then there were no more than 2-300 concurrent users.
What can you do if your load testing shows you have a problem? The first thing is to do some tuning – and remember, it’s more important to tune your queries to reduce CPU Time than to make them faster. Look at the way your data is modelled, make your DAX calculations more efficient, reduce the number of visuals in your report and so on. Once you’ve done that, if you’re using Shared (Pro) then you should consider buying PPU or a Premium capacity. If you’re already using Premium you should consider testing the scale-out feature mentioned above and either moving up to a larger SKU or enabling autoscale.