Measuring Memory And CPU Usage In Power BI During Dataset Refresh

This post is a follow-up to my recent post on identifying CPU and memory-intensive Power Query queries in Power BI. In that post I pointed out that Profiler and Log Analytics now gives you information on the CPU and memory used by an individual Power Query query when importing data into Power BI. What I didn’t notice when I wrote that post is that there is also now information available in Profiler and Log Analytics that tells you about peak memory and CPU usage across all Power Query queries for a single refresh in the Power BI Service, as well as memory usage for the refresh as a whole.

Using the same dataset from my previous post, I ran a Profiler trace on the workspace and captured the Command Begin and Command End events while I refreshed the dataset. Here’s what Profiler shows for the Command End event:

In the upper pane, the Duration column tells you how long the refresh took in milliseconds and the CPUTime column tells you how much CPU was used by both the Analysis Services engine and the Power Query engine during the refresh. This is not new, and I wrote about the CPUTime column last year here.

In the lower pane where the TMSL for the refresh operation is shown – this is the text from the TextData column – underneath the TMSL there are three new values shown:

  • PeakMemory shows the maximum amount of memory used during the refresh operation. This includes memory used by the Analysis Services engine and the Power Query engine.
  • MashupCPUTime shows the total amount of CPU used by the Power Query engine for all Power Query queries used to import data into the dataset. This value will always be less than the value shown in the CPUTime column in the upper pane.
  • MashupPeakMemory shows the maximum amount of memory used during the refresh operation by just the Power Query engine across all query evaluations. It’s important to note that this value may not be entirely accurate since the memory usage values, and therefore the peaks, are captured asynchronously so there could be higher peaks that are not detected.

This new information should be extremely useful for troubleshooting refresh failures that are caused by excessive memory consumption.

[Thanks to Akshai Mirchandani for this information]

Identifying CPU And Memory-Intensive Power Query Queries During Refresh In The Power BI Service

Last year I wrote a post about a change in the Power BI Service that meant the CPU Time associated with Power Query transformations was added to the total shown for a dataset refresh operation in Profiler traces and Log Analytics:

https://blog.crossjoin.co.uk/2022/07/03/measuring-power-query-cpu-usage-during-power-bi-dataset-refresh/

This was useful, but it didn’t tell you directly how much CPU Time was used by Power Query and it didn’t tell you which tables or partitions in a refresh were using the most CPU. It also didn’t tell you anything about Power Query memory usage. The good news that recently there has been another change that solves these problems.

Let’s say you have a Power BI dataset that consists of a single table whose source is the following Power Query query:

let
  Source = #table(type table [MyNumber = number], List.Transform({1 .. 1000000}, each {_})), 
  #"Added Custom" = Table.AddColumn(
    Source, 
    "ARandomNumber", 
    each Number.RandomBetween(0, 10000), 
    type number
  ), 
  #"Sorted Rows" = Table.Sort(#"Added Custom", {{"ARandomNumber", Order.Ascending}})
in
  #"Sorted Rows"

This query creates a table with a million rows, adds a column with random numbers in and then sorts on that column – which is, as you’d expect, a very CPU and memory-hungry operation.

If you refresh this dataset in the Power BI Service and run a Profiler trace on it, looking at the Command Begin/End and Progress Report Begin/End events, this is what you’ll see:

The final Command End event shows the toal duration of the refresh as well as the amount of CPU used by both the Analysis Services engine and Power Query – in this case 24094ms.

If you look at the Progress Report End event associated with the finish of the refresh for the only partition of the only table in the dataset (highlighted in the screenshot above), there’s some extra information:

It shows the amount of CPU Time and the maximum amount of memory used by the Power Query engine while refreshing this partition. In this case the Power Query engine used 19468ms of CPU and reached a peak of 581848KB of memory. I can tell this is going to be really useful for troubleshooting refresh performance issues and out-of-memory errors.

[Thanks to Akshai Mirchandani, Xiaodong Zhang, Ting-Wei Chang and Jast Lu for this information]

The ExtAuth Trace Event In Power BI

When you’re looking at the Log Analytics data for your Power BI Premium dataset, or studying a Profiler trace, you may see a command called ExtAuth in the OperationDetailName column:

It is documented here as “An internal service-generated command to perform authentication” – and as such it’s not something you have any control over. It may be triggered when a report is opened or a dataset refresh is started.

However you may wonder why it’s so slow sometimes. This is because datasets in the Power BI Service are not held in memory 100% of the time (they are sometimes paged out to disk) although they do need to be in memory before they can be queried or refreshed. When a dataset is not held in memory but the Power BI Service needs to connect to it, the dataset is loaded into the Analysis Services engine and into memory and this can take some time, especially if the dataset is large. The ExtAuth event is often the event that triggers the load of a dataset into memory and therefore the time required to load the dataset into memory is associated with it.

[Thanks to Akshai Mirchandani, Kay Unkroth and Rui Romano for the information in this post]

Monitor Power BI Queries And Refreshes With DirectQuery On Log Analytics, Part 4: Query Sessions

In my last post I showed how to use Log Analytics data to analyse Power BI query activity. The problem with looking at a long list of queries, though, is that it can be overwhelming and it can be hard to get a sense of when users were and weren’t actively interacting with a report. In this post I’ll show you how you can write a KQL query that gives you a summary view that solves this problem by grouping queries into sessions.

What is a session? I’m going to define it as a group of DAX queries – sessions – run by the same user that occur within 90 seconds of each other. If a user opens a Power BI report, clicks on a slicer or filter, changes a page or whatever, then so long as each of the DAX queries that are generated in the background when they do this have end times that are no more than 90 seconds apart, then that will count as a single session.

The key to solving this problem is the KQL scan operator. If, like me, you’re the kind of geek that likes data analysis you’re going to love the scan operator! There’s a great blog post explaining what it does here; in summary it allows you to do process mining, ie find sequences of events that match a given pattern. I can think of a few cool things I could do with it but grouping DAX queries into sessions is fairly straightforward. Here’s my KQL query:

PowerBIDatasetsWorkspace
| where TimeGenerated > ago(4h)
| where isempty(ExecutingUser)==false 
and ExecutingUser<>"Power BI Service"
and OperationName=="QueryEnd"
| project OperationName, OperationDetailName, TimeGenerated, 
EventText, DurationMs, ExecutingUser
| partition by ExecutingUser
(
sort by TimeGenerated asc
| scan declare (SessionStartTime:datetime) with
(
    step x: true => 
        SessionStartTime = 
            iif(isempty(x.SessionStartTime),
                TimeGenerated, 
                iif((TimeGenerated-x.TimeGenerated)>90s, 
                    TimeGenerated,x.SessionStartTime));
)
)
| summarize 
    SessionEndTime=max(TimeGenerated),
    QueryCount=count() 
    by ExecutingUser, SessionStartTime
| extend SessionLength = SessionEndTime - SessionStartTime
| sort by SessionStartTime asc

This is how the query works:

  • Filters the Log Analytics data down to the queries run in the last four hours which were run by a real user and not the Power BI Service
  • Splits this data into separate tables (using the KQL partition operator) for each user
  • Sorts these tables by the TimeGenerated column
  • For each of these tables, add a new column called SessionStartTime that contains the value from the TimeGenerated column and copies it down for each subsequent query if its start time is less than 90 seconds; if a query starts more than 90 seconds after the previous one then SessionStartTime contains that query’s TimeGenerated value again
  • Uses the summarize operator to group the data by the values in SessionStartTime and ExecutingUser

Here’s the output:

This query returns one row per user session where:

  • SessionStartTime is the end time of the first DAX query run in the session
  • SessionEnd is the end time of the last DAX query run in the session
  • QueryCount is the number of DAX queries run in the session
  • SessionLength is the duration of the session

Let me know if you have any ideas for modifying or improving this…

Monitor Power BI Queries And Refreshes With DirectQuery On Log Analytics, Part 3: Queries

In the first part of this series I showed how to create a Power BI DirectQuery dataset connected to Log Analytics to do near real-time monitoring of Analysis Services activity; in the second part I showed how to use this technique to monitor dataset refreshes in a workspace. In this blog post I’ll discuss how you can use this technique to monitor queries on your datasets.

As with the post on refreshes, the first and most important question to ask is: why do you need a DirectQuery dataset to monitor query activity? In all Power BI projects Import mode should be used where possible because it gives you better performance and more functionality. If you want to do long-term analysis of query activity then you’re better off importing the data from Log Analytics rather than using DirectQuery. So when and why would you use DirectQuery to monitory query activity in Power BI? The answer has to be when you need to be able to see what queries have been run recently, in the last hour or two, so when someone calls up to complain that their queries are slow you can see what queries they have just run and try and work out why they are slow.

It’s actually very easy to build a simple KQL query to look at query activity on your datasets: you just need to look at the QueryEnd event (or operation, as its called in Log Analytics), which is fired when a query finishes running. This event gives you all the information you need: the type of query (DAX or MDX), the duration, the CPU time, the query text and so on. The main challenge is that while you also get the IDs of the report and visual that generated the query, you don’t get the names of the report or visual. I wrote about how to get a list of visual and report IDs here and here, but how can you use that information?

Here’s an example KQL query that does just that:

let Reports = 
datatable (ReportId:string, ReportName:string)
[
    "Insert Report ID here", 
    "My Test Report"
];
let Visuals=
externaldata(VisualId: string,VisualTitle: string,VisualType: string)
[
    "Insert Shared Access Signature To CSV File Here"
]
with (format="csv", ignoreFirstRecord=true);
PowerBIDatasetsWorkspace | 
where TimeGenerated > ago(2h) |
where OperationName == 'QueryEnd' |
extend a = todynamic(ApplicationContext)|
extend VisualId = tostring(a.Sources[0].VisualId), 
ReportId = tostring(a.Sources[0].ReportId), 
Operation = coalesce(tostring(a.Sources[0].Operation), "Query"),
DatasetName = ArtifactName, Query = EventText |
lookup kind=leftouter Visuals on VisualId |
lookup kind=leftouter Reports on ReportId |
project 
ExecutingUser, TimeGenerated, WorkspaceName, DatasetName, 
Operation, OperationDetailName, 
VisualTitle, VisualType,
Status, DurationMs, CpuTimeMs, Query |
order by TimeGenerated desc 

To extract the report and visual IDs from the JSON in the ApplicationContext column I’ve used the todynamic() function and then split them out into separate columns. Then, in order to have tables to do the lookups, I used two different techniques. I used the datatable operator to create a hard-coded table with the report IDs and names; in this case I’ve only got one report, but even if you were doing this in the real world at the workspace level I think it’s feasible to create hard-coded tables with report IDs and names in because there wouldn’t be that many of them. For the visuals I saved a table with the visual IDs, names and types in a CSV file and saved it to ADLSgen2 storage, then used the externaldata() (with a shared access signature in this case just to make authentication simple) operator to read this data into a table used for the lookup.

Here’s the output:

It’s then easy to use this table in a Power BI DirectQuery dataset to build a report that shows you what queries have been run recently. This report looks a bit rubbish and I’m sure you can build something that looks a lot nicer:

Monitor Power BI Queries And Refreshes With DirectQuery On Log Analytics, Part 2: Dataset Refreshes

In the first post in this series I showed how it was possible to create a DirectQuery dataset connected to Log Analytics so you could analyse Power BI query and refresh activity in near real-time. In this post I’ll take a closer look into how you can use this data to monitor refreshes.

The focus of this series is using DirectQuery on Log Analytics to get up-to-date information (remember there’s already an Import-mode report you can use for long-term analysis of this data), and this influences the design of the dataset and report. If you’re an administrator these are the refresh-related questions you will need answers to:

  1. Which datasets are refreshing right now?
  2. Which recent datasets refreshes succeeded and which ones failed?
  3. If a dataset refresh failed, why did it fail?
  4. How long did these refreshes take?

To answer these questions I built a simple dataset with two tables in. First of all, I built a KQL query that answers these four questions and used it as the source for a DirectQuery table in my dataset. Here’s the query:

let
//get all refresh events
RefreshesInLast24Hours = 
//this table holds all the Power BI log data for a workspace
PowerBIDatasetsWorkspace
//assume that no currently running refresh has run for more than 24 hours
| where TimeGenerated > ago(24h)
//only return the events fired when refreshes begin and end
| where OperationName in ('CommandBegin', 'CommandEnd')  
and EventText has_any ('<Refresh xmlns', '\"refresh\"');
//gets just the events fired when the refreshes begin
let
CommandBegins = 
RefreshesInLast24Hours
| where OperationName == 'CommandBegin' ;
//gets just the events fired when the refreshes end
let
CommandEnds = 
RefreshesInLast24Hours
| where OperationName == 'CommandEnd' ;
//do a full outer join between the previous two tables
CommandBegins
| join kind=fullouter CommandEnds on XmlaRequestId, EventText
//select just the columns relevant to refreshes
| project ArtifactName, XmlaRequestId,
StartTime = TimeGenerated, EndTime = TimeGenerated1, 
DurationMs = DurationMs1, CpuTimeMs = CpuTimeMs1, 
Outcome = coalesce(Status1, 'In Progress'), 
ErrorCode = StatusCode1, 
WorkspaceId, WorkspaceName
| summarize StartTime=min(StartTime), EndTime=max(EndTime), DurationMs=max(DurationMs), 
CpuTimeMs=max(CpuTimeMs), ErrorCode=min(ErrorCode) 
by XmlaRequestId, ArtifactName, Outcome, WorkspaceId, WorkspaceName

Here’s an example of the output, a table with one row for each refresh in the last 24 hours:

Here’s how the query works:

  • Gets all the CommandBegin and CommandEnd events – the events that are fired when a refresh begins and ends – that occurred in the last 24 hours and stores them in two tables.
    • Only some of the CommandBegin and and CommandEnd events are related to refreshes, so the query filters on the EventText column to find them.
    • I chose to limit the data to the last 24 hours to keep the amount of data displayed to a minimum and because it’s unlikely that any refresh will take more than 24 hours. If I limited the data to the last hour, for example, then the query would not return the CommandBegin events or the CommandEnd events for any refreshes that started more than an hour ago and were still in progress, so these refreshes would not be visible in the report.
  • Does a full outer join between the table with the CommandBegin events and the table with the CommandEnd events, so the query returns one row for every that either started or ended in the last 24 hours.
  • Limits the columns in the results to show:
    • XmlaRequestId – the unique identifier that links all the events for a specific refresh operation
    • ArtifactName – the dataset name
    • WorkspaceId, WorkspaceName – the ID and name of the Power BI workspace where the dataset is stored
    • StartTime and EndTime – the refresh start time and end time,
    • DurationMs – the duration of the refresh in milliseconds
    • CpuTimeMs – the CPU Time consumed by the refresh in milliseconds,
    • Outcome – whether the refresh succeeded, failed or is still in progress
    • Errorcode – the error code that tells you why the refresh failed

Next I created a second table to return more detailed information about a specific refresh. Here’s an example of the KQL:

PowerBIDatasetsWorkspace
| where TimeGenerated > ago(24h)
| where OperationName in 
('CommandBegin', 'CommandEnd', 
'ProgressReportEnd','Error') 
| where XmlaRequestId in 
('5306b4ef-d74a-4962-9fc0-f7a299f16d61')
|project OperationName, OperationDetailName, TimeGenerated, 
EventText, XmlaRequestId, DurationMs, CpuTimeMs

This query returns the CommandBegin, CommandEnd, ProgressReportEnd and Error events for a specific refresh (so all these events have the same XmlaRequestId value).

One last complication was that, in my Power BI dataset, I was not able to create a relationship between these two table (I got some kind of query folding error although I don’t know why), so in order to be able to drill from the summary table to the detail table I had to use a dynamic M parameter for the XmlaRequestId used to filter the second table (see this post for more details on how to implement dynamic M parameters when working with KQL queries).

Here’s the report I built on this dataset showing how you can drill from the summary page to the details page for a particular refresh:

Not exactly pretty, but I’m not much of a data visualisation person and I think tables are fine for this type data. You can download a sample .pbix file (I couldn’t make a .pbit file work for some reason…) for this report here and prettify it yourself if you want. All you need to do is change the Cluster and Database parameters in the Power Query Editor to match the details of your Log Analytics workspace and database (see my last post for more details).

[Thanks to Chris Jones at Microsoft for his ideas and suggestions on this topic]

[Update 10th January 2022 – I’ve modified the summary KQL query based on feedback from another Microsoft colleague, Xiaodong Zhang, who pointed out that it didn’t work for TMSL scripts or datasets that use incremental refresh]

Monitor Power BI Queries And Refreshes With DirectQuery On Log Analytics, Part 1: Creating A Dataset

As a Power BI administrator you want to see what’s happening in your tenant right now: who’s running queries, which datasets are refreshing and so on. That way if a user calls you to complain that their report is slow or their dataset hasn’t refreshed yet you can start troubleshooting immediately. Power BI’s integration with Log Analytics (currently in preview with some limitations) is a great source of information for this kind of troubleshooting: it gives you the ability to send various useful Analysis Services engine events, events that give you detailed information about queries and refreshes among other things, to Log Analytics with a latency of only a few minutes. Once you’ve done that you can write KQL queries to understand what’s going on, but writing queries is time consuming – what you want, of course, is a Power BI report.

The dataset behind that Power BI report will need to be in DirectQuery mode if you want to see the most up-to-date data; unfortunately the current Log Analytics for Power BI Datasets template app is Import mode, which is only good if you want to look at long-term trends. It’s not obvious how to connect Power BI to Log Analytics in DirectQuery mode but it is possible: some time ago John White wrote a detailed blog post showing how you could use the Azure Data Explorer connector to allow Power BI to connect to Log Analytics in DirectQuery mode. If you read the comments on that post you’ll see that this method stopped working soon afterwards, although there was a workaround using an older version of the connector. The good news is that as of the December 2021 release of Power BI Desktop the bug was fixed and the details in John’s post are correct again. There is one other issue that needs mentioning: at the time of writing it only seems to be possible to create a DirectQuery dataset on Log Analytics if you specify a native KQL query, but since that can be simply the name of the table you want to query in Log Analytics that’s not much of a blocker.

Putting this all together, to be able to build a Power BI DirectQuery dataset and report to analyse query and refresh activity, you need to:

  1. Set up a Log Analytics workspace in the Azure Portal – see here for instructions.
  2. Connect the Power BI Service to this new Log Analytics workspace – see here for instructions – so that query and refresh events are logged to it.
  3. Open up Power BI Desktop and connect to the Log Analytics workspace using the Azure Data Explorer connector. In the connection dialog:
    • Enter a url pointing to your Log Analytics workspace in the following format in the Cluster box:
      https://ade.loganalytics.io/subscriptions/enter Azure subscription id here/resourcegroups/enter Azure resource group name here/providers/microsoft.operationalinsights/workspaces/enter Log Analytics workspace name here
    • Enter your Log Analytics workspace name in the Database box
    • In the “Table name or Azure Data Explorer query (optional)” box you need to enter a custom KQL query, which can simply be the following table name:
      PowerBIDatasetsWorkspace
    • Scroll down to the end of the dialog (which may be off the screen) and select DirectQuery for the Data Connectivity mode.

At this point you’ll have a dataset with a single table in DirectQuery mode that you can build your reports from; you might also want to use automatic page refresh to make sure the report page always shows the latest data. For example, here’s a very simple report showing query durations:

How do you interpret the data in the table in the dataset? Are there any interesting analyses you can do with more complex KQL queries or different ways of modelling the data? All good questions and ones that are out of scope for now. I’ll try to answer these questions in my next few posts though.

Matching DAX Queries To Individual Visuals In A Published Power BI Report

The integration between Azure Log Analytics and Power BI opens up a lot of new possibilities for Power BI administrators. It’s still in preview (there’s a lot more cool stuff still to come) but the Analysis Services events that are available at the time of writing are still very useful: they give you pretty much everything you had in Profiler plus some new stuff too. In this post I’ll show you how you can use this data to work out which visual in a published report generated a particular DAX query.

In Log Analytics you can get information about all the queries run against a dataset using a simple KQL query like this one (I haven’t included an explicit time filter in this query, although you would want to do this in real life):

PowerBIDatasetsWorkspace | 
where ArtifactName == "InsertWorkspaceNameHere" 
and OperationName == 'QueryEnd' 

The QueryEnd event is raised every time a query finishes running – it’s an event you may already be familiar with from Profiler, xEvents or the Log Analytics integration with AAS – and it tells you useful things like the time the query finished, the duration, the text of the DAX query, the user running the query and so on. However what is new for Power BI is the contents ApplicationContext column. My colleague Kasper blogged about how you can use this column to get the DatasetId and ReportId when using AAS here; it now also contains a GUID that identifies the exact visual on a report that generated the DAX query.

Taking a closer look at this column shows that it contains a JSON document:

Here’s a more sophisticated KQL query that parses that JSON document, extracts the contents into separate columns and returns other columns you might need for troubleshooting like the DAX query text and query duration:

PowerBIDatasetsWorkspace | 
where ArtifactName == "InsertWorkspaceNameHere" 
and OperationName == 'QueryEnd' |
extend a = todynamic(ApplicationContext)|
extend VisualId = a.Sources[0].VisualId, ReportId = a.Sources[0].ReportId, 
DatasetName = ArtifactName, DAXQuery = EventText |
project TimeGenerated, WorkspaceName, DatasetName, ReportId, VisualId, DurationMs, DAXQuery |
order by TimeGenerated desc 

The ID of the dataset isn’t that interesting because you can get it, and the name of the dataset, in other columns. The ReportId is useful because it tells you which report generated the query and there are a number of ways you can find out which report this relates to. For example, you can just open the report in the browser and look at the url: the Report ID is the GUID between “/reports/” and “/ReportSection” as detailed here.

How can you work out which visual the VisualId relates to though? It’s not obvious, but it is possible. First you have to open your report in Power BI Desktop, open Performance Analyzer, refresh the report page and export the data from Performance Analyzer as a JSON file as detailed here. The data you need is hidden in this JSON file in the Visual Container Lifecycle event; here’s the M code for a Power Query query to get it:

let
  Source = Json.Document(
    File.Contents(
      "C:\InsertFileNameHere.json"
    )
  ), 
  events = Source[events], 
  #"Converted to Table" = Table.FromList(
    events, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ), 
  #"Expanded Column1"
    = Table.ExpandRecordColumn(
    #"Converted to Table", 
    "Column1", 
    {
      "name", 
      "component", 
      "start", 
      "id", 
      "metrics", 
      "end", 
      "parentId"
    }, 
    {
      "name", 
      "component", 
      "start", 
      "id", 
      "metrics", 
      "end", 
      "parentId"
    }
  ), 
  #"Filtered Rows" = Table.SelectRows(
    #"Expanded Column1", 
    each (
      [name]
        = "Visual Container Lifecycle"
    )
  ), 
  #"Expanded metrics"
    = Table.ExpandRecordColumn(
    #"Filtered Rows", 
    "metrics", 
    {
      "status", 
      "visualTitle", 
      "visualId", 
      "visualType", 
      "initialLoad"
    }, 
    {
      "status", 
      "visualTitle", 
      "visualId", 
      "visualType", 
      "initialLoad"
    }
  ), 
  #"Removed Other Columns"
    = Table.SelectColumns(
    #"Expanded metrics", 
    {
      "visualTitle", 
      "visualId", 
      "visualType"
    }
  )
in
  #"Removed Other Columns"

Here’s an example of the output of this Power Query query:

This shows that, on the report in question, there were three visuals: a card visual called “My Card”, a column chart called “My Column Chart” and a table called “My Table”. It also shows the IDs of these visuals which you can match up with the VisualId values in the ApplicationContext column in Log Analytics.

The last problem is matching these names up with the actual visuals in the report and this will only be easy if the report designer has followed the best practice of giving each visual a meaningful name; you can now do this by double-clicking on the visual in the Selection pane or by editing the Title Text property as Reid Havens shows here.

[Edit October 2023 – there’s now an even easier way to get these IDs using Power BI Desktop Developer Mode https://blog.crossjoin.co.uk/2023/10/22/getting-report-visual-ids-with-power-bi-desktop-developer-mode/ ]