Monitor Power BI Queries And Refreshes With DirectQuery On Log Analytics, Part 2: Dataset Refreshes

In the first post in this series I showed how it was possible to create a DirectQuery dataset connected to Log Analytics so you could analyse Power BI query and refresh activity in near real-time. In this post I’ll take a closer look into how you can use this data to monitor refreshes.

The focus of this series is using DirectQuery on Log Analytics to get up-to-date information (remember there’s already an Import-mode report you can use for long-term analysis of this data), and this influences the design of the dataset and report. If you’re an administrator these are the refresh-related questions you will need answers to:

  1. Which datasets are refreshing right now?
  2. Which recent datasets refreshes succeeded and which ones failed?
  3. If a dataset refresh failed, why did it fail?
  4. How long did these refreshes take?

To answer these questions I built a simple dataset with two tables in. First of all, I built a KQL query that answers these four questions and used it as the source for a DirectQuery table in my dataset. Here’s the query:

let
//get all refresh events
RefreshesInLast24Hours = 
//this table holds all the Power BI log data for a workspace
PowerBIDatasetsWorkspace
//assume that no currently running refresh has run for more than 24 hours
| where TimeGenerated > ago(24h)
//only return the events fired when refreshes begin and end
| where OperationName in ('CommandBegin', 'CommandEnd')  
and EventText has_any ('<Refresh xmlns', '\"refresh\"');
//gets just the events fired when the refreshes begin
let
CommandBegins = 
RefreshesInLast24Hours
| where OperationName == 'CommandBegin' ;
//gets just the events fired when the refreshes end
let
CommandEnds = 
RefreshesInLast24Hours
| where OperationName == 'CommandEnd' ;
//do a full outer join between the previous two tables
CommandBegins
| join kind=fullouter CommandEnds on XmlaRequestId, EventText
//select just the columns relevant to refreshes
| project ArtifactName, XmlaRequestId,
StartTime = TimeGenerated, EndTime = TimeGenerated1, 
DurationMs = DurationMs1, CpuTimeMs = CpuTimeMs1, 
Outcome = coalesce(Status1, 'In Progress'), 
ErrorCode = StatusCode1, 
WorkspaceId, WorkspaceName
| summarize StartTime=min(StartTime), EndTime=max(EndTime), DurationMs=max(DurationMs), 
CpuTimeMs=max(CpuTimeMs), ErrorCode=min(ErrorCode) 
by XmlaRequestId, ArtifactName, Outcome, WorkspaceId, WorkspaceName

Here’s an example of the output, a table with one row for each refresh in the last 24 hours:

Here’s how the query works:

  • Gets all the CommandBegin and CommandEnd events – the events that are fired when a refresh begins and ends – that occurred in the last 24 hours and stores them in two tables.
    • Only some of the CommandBegin and and CommandEnd events are related to refreshes, so the query filters on the EventText column to find them.
    • I chose to limit the data to the last 24 hours to keep the amount of data displayed to a minimum and because it’s unlikely that any refresh will take more than 24 hours. If I limited the data to the last hour, for example, then the query would not return the CommandBegin events or the CommandEnd events for any refreshes that started more than an hour ago and were still in progress, so these refreshes would not be visible in the report.
  • Does a full outer join between the table with the CommandBegin events and the table with the CommandEnd events, so the query returns one row for every that either started or ended in the last 24 hours.
  • Limits the columns in the results to show:
    • XmlaRequestId – the unique identifier that links all the events for a specific refresh operation
    • ArtifactName – the dataset name
    • WorkspaceId, WorkspaceName – the ID and name of the Power BI workspace where the dataset is stored
    • StartTime and EndTime – the refresh start time and end time,
    • DurationMs – the duration of the refresh in milliseconds
    • CpuTimeMs – the CPU Time consumed by the refresh in milliseconds,
    • Outcome – whether the refresh succeeded, failed or is still in progress
    • Errorcode – the error code that tells you why the refresh failed

Next I created a second table to return more detailed information about a specific refresh. Here’s an example of the KQL:

PowerBIDatasetsWorkspace
| where TimeGenerated > ago(24h)
| where OperationName in 
('CommandBegin', 'CommandEnd', 
'ProgressReportEnd','Error') 
| where XmlaRequestId in 
('5306b4ef-d74a-4962-9fc0-f7a299f16d61')
|project OperationName, OperationDetailName, TimeGenerated, 
EventText, XmlaRequestId, DurationMs, CpuTimeMs

This query returns the CommandBegin, CommandEnd, ProgressReportEnd and Error events for a specific refresh (so all these events have the same XmlaRequestId value).

One last complication was that, in my Power BI dataset, I was not able to create a relationship between these two table (I got some kind of query folding error although I don’t know why), so in order to be able to drill from the summary table to the detail table I had to use a dynamic M parameter for the XmlaRequestId used to filter the second table (see this post for more details on how to implement dynamic M parameters when working with KQL queries).

Here’s the report I built on this dataset showing how you can drill from the summary page to the details page for a particular refresh:

Not exactly pretty, but I’m not much of a data visualisation person and I think tables are fine for this type data. You can download a sample .pbix file (I couldn’t make a .pbit file work for some reason…) for this report here and prettify it yourself if you want. All you need to do is change the Cluster and Database parameters in the Power Query Editor to match the details of your Log Analytics workspace and database (see my last post for more details).

[Thanks to Chris Jones at Microsoft for his ideas and suggestions on this topic]

[Update 10th January 2022 – I’ve modified the summary KQL query based on feedback from another Microsoft colleague, Xiaodong Zhang, who pointed out that it didn’t work for TMSL scripts or datasets that use incremental refresh]

Monitor Power BI Queries And Refreshes With DirectQuery On Log Analytics, Part 1: Creating A Dataset

As a Power BI administrator you want to see what’s happening in your tenant right now: who’s running queries, which datasets are refreshing and so on. That way if a user calls you to complain that their report is slow or their dataset hasn’t refreshed yet you can start troubleshooting immediately. Power BI’s integration with Log Analytics (currently in preview with some limitations) is a great source of information for this kind of troubleshooting: it gives you the ability to send various useful Analysis Services engine events, events that give you detailed information about queries and refreshes among other things, to Log Analytics with a latency of only a few minutes. Once you’ve done that you can write KQL queries to understand what’s going on, but writing queries is time consuming – what you want, of course, is a Power BI report.

The dataset behind that Power BI report will need to be in DirectQuery mode if you want to see the most up-to-date data; unfortunately the current Log Analytics for Power BI Datasets template app is Import mode, which is only good if you want to look at long-term trends. It’s not obvious how to connect Power BI to Log Analytics in DirectQuery mode but it is possible: some time ago John White wrote a detailed blog post showing how you could use the Azure Data Explorer connector to allow Power BI to connect to Log Analytics in DirectQuery mode. If you read the comments on that post you’ll see that this method stopped working soon afterwards, although there was a workaround using an older version of the connector. The good news is that as of the December 2021 release of Power BI Desktop the bug was fixed and the details in John’s post are correct again. There is one other issue that needs mentioning: at the time of writing it only seems to be possible to create a DirectQuery dataset on Log Analytics if you specify a native KQL query, but since that can be simply the name of the table you want to query in Log Analytics that’s not much of a blocker.

Putting this all together, to be able to build a Power BI DirectQuery dataset and report to analyse query and refresh activity, you need to:

  1. Set up a Log Analytics workspace in the Azure Portal – see here for instructions.
  2. Connect the Power BI Service to this new Log Analytics workspace – see here for instructions – so that query and refresh events are logged to it.
  3. Open up Power BI Desktop and connect to the Log Analytics workspace using the Azure Data Explorer connector. In the connection dialog:
    • Enter a url pointing to your Log Analytics workspace in the following format in the Cluster box:
      https://ade.loganalytics.io/subscriptions/enter Azure subscription id here/resourcegroups/enter Azure resource group name here/providers/microsoft.operationalinsights/workspaces/enter Log Analytics workspace name here
    • Enter your Log Analytics workspace name in the Database box
    • In the “Table name or Azure Data Explorer query (optional)” box you need to enter a custom KQL query, which can simply be the following table name:
      PowerBIDatasetsWorkspace
    • Scroll down to the end of the dialog (which may be off the screen) and select DirectQuery for the Data Connectivity mode.

At this point you’ll have a dataset with a single table in DirectQuery mode that you can build your reports from; you might also want to use automatic page refresh to make sure the report page always shows the latest data. For example, here’s a very simple report showing query durations:

How do you interpret the data in the table in the dataset? Are there any interesting analyses you can do with more complex KQL queries or different ways of modelling the data? All good questions and ones that are out of scope for now. I’ll try to answer these questions in my next few posts though.

Seventeenth Blog Birthday

Every year, on the anniversary of the first-ever post on this blog, I write a post reflecting on what has happened to me professionally in the past year. In the past this has meant I’ve written about learning some new technology or language (yes, DAX and M were new once), dealing with a business issue back when I had my own company, or more recently adjusting to life as a full-time employee at Microsoft. This year’s new challenge has been becoming a manager for the first time.

One of the great things about working for Microsoft is that, unlike many other companies, you don’t need to become a manager to progress in your career. However, the Power BI CAT team is growing and when I was given the opportunity to manage a part of it I thought, why not? I’m in my late 40s and have never had anyone reporting to me before – in my defence I spent many years working for myself – so I thought it would be good to get out of my comfort zone and try something new. It helped that I have a very supportive manager and that several other people on the team made the transition to management at the same time, which means we’re all learning together. I have particularly enjoyed recruiting new people for my team from inside and outside Microsoft: I have some extremely talented people on my team already, with several more due to start in early 2022.

This doesn’t mean I have left technical things behind though, just that I’m spending less time doing technical things and more time managing other people who do technical things. I think it’s important to stay as technical as I can and to maintain some direct contact with customers in order for me to be an effective manager; again, one of the things I like about Microsoft and my team in particular is that I haven’t had to make a binary choice between being a manager or being technical. That said I have had to accept that, more than before, there are problems I can’t help solve and shiny new things I don’t have time to learn about, and that has been hard.

None of this will affect this blog’s focus on Power BI (I’m certain no-one is interested in my thoughts about management…) or how often I blog but it will accelerate a trend that I suspect has been apparent for the last year or so. The primary motivation for me to blog has always been my own education: writing down information I can’t find anywhere else means it doesn’t get lost and explaining it to other people helps me understand it myself. Now, though, this is pretty much the only reason for me to blog, which means even more of the obscure factoids about Power Query data privacy settings and even less of the click-friendly top ten lists about Power BI/Excel integration type of content. I don’t think many people came here for the introductory tutorials though, did they?

Power BI DirectQuery Query Labels For SQL Server-Related Sources

If you’re using Power BI in DirectQuery mode against a SQL Server-related data source (ie SQL Server on-prem, Azure SQL DB or Synapse) you may have noticed a new feature that was released a few weeks ago: query labels that allow you to link a SQL query to the Power BI visual that generated it.

There’s nothing you need to do to enable it – it happens automatically. Here’s an example of a TSQL query generated by a DirectQuery mode dataset with it in:

You can see that an OPTION clause in the query adds a label that contains three GUIDs that identify the dataset, report and visual that the query was generated for. These are the same GUIDs that you’ll see used for this purpose in other places, for example Log Analytics. This post has more details on how you can work out which datasets, reports and and visuals these GUIDs relate to.

As I said, this only works for SQL Server-related sources at the moment, but if you think this would be useful for other sources (and I’m sure you do…) please leave a comment below!

Simulating Slow Data Sources In Power BI

As a postscript to my series on Power BI refresh timeouts (see part 1, part 2 and part 3) I thought it would be useful to document how I was able to simulate a slow data source in Power BI without using large data volumes or deliberately complex M code.

It’s relatively easy to create an M query that returns a table of data after a given delay. For example, this query returns a table with one column and one row after one hour and ten minutes:

let
  Source = Function.InvokeAfter(
    () => #table({"A"}, {{1}}), 
    #duration(0, 1, 10, 0)
  )
in
  Source

Some notes:

  • I’m using #table to return the table without having to connect to a data source. More details on how to use #table can be found here.
  • The delay is achieved using the Function.InvokeAfter M function, with the amount of time to wait for specified using #duration

A more interesting problem is how to create an M query that, instead of waiting for a given duration and then returning a table immediately, returns the rows of a table one at a time with a delay between each row. Here’s a query that does that, returning ten rows one second at a time:

let
  NumberOfRows = 10,
  DaysToWait = 0,
  HoursToWait = 0,
  MinutesToWait = 0,
  SecondsToWait = 1,
  Source = #table(
    {"A"},
    List.Transform(
      {1 .. NumberOfRows},
      each Function.InvokeAfter(
        () => {1},
        #duration(
          DaysToWait,
          HoursToWait,
          MinutesToWait,
          SecondsToWait
        )
      )
    )
  )
in
  Source

Last of all, to simulate a slow SQL Server data source – not being much good at TSQL at all – I borrowed some code from this thread on Stack Overflow to create a function that returns a scalar value after a specified number of seconds:

CREATE FUNCTION [dbo].[ForceDelay](@seconds int) returns int as 
BEGIN DECLARE @endTime datetime2(0) = DATEADD(SECOND, @seconds, GETDATE()); 
WHILE (GETDATE() < @endTime ) 
SET @endTime = @endTime;
return 1;
END

I showed how to call this function from Power Query using a native SQL query here.

Troubleshooting Power BI Timeouts, Part 3: The External Command Timeout

In the first post in this series I showed how any Power BI dataset refresh started via the Power BI portal or API is limited to 2 hours in Shared capacity and 5 hours in Premium capacity, and how you could work around that by running a refresh via Premium’s XMLA endpoint feature. In the second post in this series I showed how some M functions allow you to set timeouts. However, even if you initiate a refresh via the XMLA endpoint you may still get a timeout error and in this post I’ll discuss another reason why: the External Command Timeout.

This property is a hangover from Analysis Services (you can see it documented here). It represents the amount of time the Analysis Services engine inside Power BI will wait to get data from a data source. How it behaves exactly depends on the data source: it may limit the amount of time it takes to get the first row of data from the source or it may limit the amount of time it takes to get all the rows of data. In Power BI Premium it is set to five hours, which means that no single partition can take more than about five hours to refresh. In the first post in this series I worked around this by creating a dataset with multiple partitions, each of which took about an hour to refresh, but when trying to refresh a dataset with a single partition that takes more than five hours I got the following error when trying to refresh from SQL Server Management Studio through the XMLA Endpoint:

The error message here is:

Timeout expired. The timeout period elapsed prior to completion of the operation.. The exception was raised by the IDbCommand interface.

In this case I saw the same error in the Error event in Profiler:

…and in other cases, when testing a different source, I got a different error in Profiler in the Progress Report Error event:

The message here is:

Error processing partition ‘<pii>SlowCommand-2114bb81-69d3-4fe4-9d54-6b2661b0c497</pii>’ of table ‘<pii>SlowCommand</pii>’ [Internal Error Code: ‘0xc112001a’].

There’s no way to avoid the External Command Timeout. Instead, what you need to do is either change your partitioning strategy so each partition refreshes in under five hours or tune your data source, M code or gateway (if you’re using one) so that data is returned to Power BI faster.

In Shared capacity I believe the External Command Timeout is set to two hours (again, to match the overall refresh timeout) but it’s much less important there because you can’t create partitions manually (the only way a dataset in Shared can be partitioned is by setting up incremental refresh) and there’s no XMLA Endpoint so there’s no way to work around the two hour overall refresh limit anyway.

[Thanks, as always, to Akshai Mirchandani for a lot of the information in this post]

Troubleshooting Power BI Timeouts, Part 2: Timeouts Specified In Power Query Functions

In the first post in this series I showed how the Power BI Service applies a limit on the total amount of time it takes to refresh a dataset in the Power BI Service, except when you initiate your refresh via an XMLA Endpoint. In this post I’ll look at the various timeouts that can be configured in Power Query functions that are used to access data.

Every time a Power BI Import mode dataset connects to a data source it goes through a Power Query query, and inside the code of that Power Query query will be an M function that connects to a specific type of data source. Most – but not all – of these M functions have the option to set timeouts. The M functions that connect to relational databases (for example Sql.Database which is used to connect to SQL Server-related sources, or Odbc.DataSource which is used to connect to ODBC sources) typically allow you to configure two types of timeout:

  • A connection timeout, which specifies how long to wait when trying to open a connection to the data source
  • A command timeout, which specifies how long the query to get data from the source is allowed to run

Some other functions have other timeouts more appropriate to the data source they access: for example Web.Contents and OData.Feed have a Timeout property for the HTTP requests they make behind the scenes. Other functions (for example Excel.Workbook) have no timeout-related properties that you can set at all. You should check the documentation , either online or using the #shared intrinsic variable in the Power Query Editor, to see what options are available and what the default timeouts are.

Here’s a simple example of how to set a timeout when connecting to SQL Server. First of all, I created a scalar function called ForceDelay in TSQL that returns a value of 1 after a specified number of seconds, so the query:

select [dbo].[ForceDelay](10) as MyColumn

…takes 10 seconds to run.

When you connect to SQL Server in Power BI/Power Query you can paste a SQL query in the connection dialog under Advanced Options; when you do so you’ll also see the option to set the command timeout in minutes:

Here’s the M code generated when you use the SQL query above, set the Command timeout to 1 minute and click OK:

let
  Source = Sql.Database(
    "localhost",
    "AdventureWorksDW2017",
    [
      Query
        = "select [dbo].[ForceDelay](10) as MyColumn",
      CommandTimeout = #duration(
        0,
        0,
        1,
        0
      )
    ]
  )
in
  Source

Notice that the CommandTimeout option has been set on the Sql.Database function, and that the value passed to it is a duration of one minute defined using #duration. Since the SQL query takes 10 seconds to run and the timeout is 1 minute, it runs successfully.

However, if you set the CommandTimeout option to 5 seconds like so:

let
  Source = Sql.Database(
    "localhost", 
    "AdventureWorksDW2017", 
    [
      Query
        = "select [dbo].[ForceDelay](10) as MyColumn", 
      CommandTimeout = #duration(
        0, 
        0, 
        0, 
        5
      )
    ]
  )
in
  Source

…then the query will take longer that the timeout and you’ll see the following error in the Power Query Editor:

The error message is:

DataSource.Error: Microsoft SQL: Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding.

If you hit the timeout when refreshing in the main Power BI window you’ll see the same message in the refresh dialog box:

So, as you can see, if you are working with large data volumes or slow queries you will need to be aware of the default timeouts set in the M functions you’re using and alter them if need be.

Troubleshooting Power BI Timeouts, Part 1: The Two/Five Hour Import Refresh Limit

If you’re working with a large Power BI dataset and/or a slow data source in Import mode it can be very frustrating to run into timeout errors after you have already waited a long time for a refresh to finish. There are a number of different types of timeout that you might run into, and in this series I’ll look at a few of them and discuss some of the ways you can work around them.

In this post I’ll look at one of the most commonly-encountered timeouts: the limit on the maximum length of time an Import mode dataset refresh can take. As documented here these limits are:

  • Two hours for an import-mode dataset in Shared capacity
  • Five hours for an import-mode dataset in a Premium or PPU capacity

If you are using Premium you can connect to your dataset using SQL Server Profiler and run a trace when the refresh starts and you’ll see the timeout set as a property of the TMSL command that runs the refresh (18000 seconds = 5 hours):

Here’s an example of the message you will see in the Refresh History dialog if your dataset refresh takes more than five hours in PPU:

Once again you get more details in a Profiler trace. When the refresh times out you’ll see an Error event with long message that starts with the following text:

Timeout expired. The timeout period elapsed prior to completion of the operation.. The exception was raised by the IDbCommand interface. The command has been canceled..

There is a way to work around this limit if you’re using Premium or PPU: it only applies to refreshes (scheduled or manual) that you start from the Power BI portal or via the REST API. If you start your refresh by running a TMSL script via the XMLA Endpoint (for example via SQL Server Management Studio or from PowerShell) then the limit does not apply. This is because you’re executing your own TMSL script rather than having the Power BI Service execute its own refresh command – with a timeout set – like the one shown in the screenshot above.

For example, here’s the same dataset from the screenshot above but refreshed successfully from SQL Server Management Studio (notice the Type column says “Via XMLA Endpoint”) and with a duration of just over five hours:

There are a couple of blog posts out there showing how you can implement a refresh strategy using the XMLA Endpoint; this post from Marc Lelijveld and Paulien van Eijk is a great example of how to do this using Azure Automation and Azure Data Factory.

You should also consider tuning any refresh that takes a long time and it could be that after tuning it you fall well within the two/five hour limit. There are a lot of things to consider when tuning dataset refresh; I did a conference session on this topic last year (you can watch the recording here) covering issues like data modelling, query folding in Power Query and the use of dataflows but it doesn’t cover everything and I have learned a lot even since then. In my opinion one of the most important things you can do to improve refresh performance for very large Import datasets is to partition your fact tables and increase the amount of parallelism, as I describe here. One customer I worked with was able to reduce their refresh time from four hours down to one hour using this technique:

As I said, though, there are many different types of timeout that you may encounter – so even if you refresh via the XMLA Endpoint it may still time out for another reason.

Setting SQL Server CONTEXT_INFO In Power Query

In my quest to check out every last bit of obscure Power Query functionality, this week I looked into the ContextInfo option on the Sql.Database and Sql.Databases M functions. This option allows you to set CONTEXT_INFO in SQL Server (see here for a good article explaining what this is) and here’s an example of how to use it:

let
  Source = Sql.Database(
    "localhost", 
    "AdventureWorksDW2017", 
    [
      Query = "SELECT * FROM DIMDATE", 
      ContextInfo = Text.ToBinary(
        "Hello"
      )
    ]
  )
in
  Source

This Power Query query runs a simple SQL SELECT statement against the SQL Server Adventure Works DW 2017 database. Note that since you need to pass a binary value to the ContextInfo option, in this example I had to use the Text.ToBinary function to convert my text to binary.

Here’s what happens in SQL Server when this Power Query query is run:

Here’s a simple example of how to retrieve this data on the SQL Server side:

SELECT session_id, login_time, program_name, context_info 
FROM sys.dm_exec_sessions
WHERE session_id=57

I’ll leave it to the SQL Server experts to decide what this can be used for and no doubt to complain that it would be more useful to support SESSION_CONTEXT too – although I’ve heard that might already be used for something, so I need to do more research here…

New Book: “Extending Power BI With Python And R”

Back in September I posted about a few new Power BI-related books I was given to review for free; that post led to me getting sent another freebie book, “Extending Power BI with Python and R”, by Luca Zavarella (buy it from Amazon UK here). I found this book particularly interesting because I know very little about Python or R, but I also know that this is a really hot topic for many people and I was curious to know what problems using these languages in Power BI might solve.

I can’t comment on the quality of the Python and R advice (although I’m pretty sure Luca knows what he’s writing about), but from the point of view of a Power BI developer the book does a good job of explaining how using them allows you to do things that are difficult or impossible otherwise. There are chapters on regular expressions, calling APIs, using machine learning models and advanced visualisations. I haven’t seen any other books, videos or blog posts that cover these topics in such detail, so if you have some Python or R skills and want to make use of them in Power BI this book seems to be a good bet.

%d bloggers like this: