Extracting Power BI Import Mode Job Graph Data To A Table

In my last post but one I showed how you could create a DGML file that contains detailed information on what happens during a Power BI Import mode refresh – the “job graph” – and visualise that in Visual Studio or VS Code, to help with performance tuning. In my last post, I explained the concepts of blocking and waiting which are key to understanding this data. In this post I’ll share and describe a Dataflow template that extracts the data contained in the DGML file to a table for other types of analysis – which turns out to be quite simple, because DGML is an XML-based format which is easy to work with in Power Query and Dataflows.

You can download the .pqt template file here (for more information on how to use templates see here). To use it, create a new Dataflow Gen2 and import the template file. You’ll see the following queries:

Next, change the values of the three parameters to contain the IDs of the refresh whose DGML file you want to load (the notebook in this post that creates the DGML file uses the refresh ID as the file’s name) and the workspace and lakehouse where that file is stored. Finally, set output destinations on either the RefreshJobs and/or RefreshJobsPivoted queries. The RefreshJobs query returns a table with one row per refresh job:

This gives you all the data in the DGML file that was visible using a DGML viewer but in table form. For each job you get the CreatedAt, RunnableAt, StartedAt and FinishedAt datetimes, the Blocked, Wait and Running durations, the Slot ID, the CPU used and whether the job is on the critical path.

The RefreshJobsPivoted query gives you exactly the same data but instead of giving you one row per job, you get three: one row for the blocked, waiting and running phases of each job, with the start and end times for each phase and the duration. This may be a more convenient format for visualisation and analysis:

The Links query gives you the dependencies between the jobs:

Having loaded all this data into a Lakehouse you can now build a report on it. As I said, I’m sure there’s a way of visualising all of this data in a way that shows all the durations and dependencies, but even a basic table report on the RefreshJobs table like this is really useful because it shows you which jobs were blocked, which ones had to wait, and which ones took a long time to run:

In this example (the Dependencies example from my previous post, but where the refresh has maxParallelism set to 1, so there is blocking as well as waiting) you can see that refreshing table X took 60 seconds, that the refresh for table Y had to wait for 60 seconds before it could start and took 10 seconds, and that refresh for table XY was blocked for 70 seconds before it could start. So, if you wanted to make this refresh run faster, you would want to understand why table X was so slow and also look at increasing the amount of parallelism so Y did not have to wait.

Understanding Blocking And Waiting In Power BI Import Mode Refreshes

Following on from my previous post showing how you can visualise the job graph for a Power BI Import mode semantic model refresh, I this post I will look at how you can interpret what the job graph tells you – specifically, explaining the concepts of blocking and waiting. As always, simple examples are the best way of doing this.

Blocking

Blocking occurs when one job can only start when one or more other jobs have completed because there is a dependency between them. Consider the following semantic model:

X and Y are tables that contain a single numeric column. X takes 1 minute to load while Y takes 10 seconds to load (I forced this delay using technique I blogged about here). XYUnion is a DAX calculated table that unions all the rows from X and Y with the following definition:

XYUnion = UNION(X,Y)

As you can imagine, the job that refreshes XYUnion can only start once the jobs that refresh both X and Y have finished; XYUnion will be blocked until both X and Y have refreshed. Here’s what the job graph for the refresh of this semantic model looks like:

At the centre is the job “Process Partition XYUnion[XYUnion]” which refreshes the calculated table XYUnion. The arrows going from this job to the jobs “Process Partition X[X]” and “Process Partition Y[Y]”, which refresh tables X and Y, show that this job depends on those two other jobs.

Hovering over the “Process Partition XYUnion[XYUnion” job shows the following popup:

There are four datetime values here: CreatedAt, RunnableAt, StartedAt and FinishedAt. There are also three durations:

  • Block is the elapsed time between CreatedAt and RunnableAt, and is the amount of time elapsed before all the jobs this job depends on were completed. Anything other than a zero here means that blocking has occurred.
  • Wait is the elapsed time between RunnableAt and StartedAt. A job becomes runnable when all the jobs it depends on have completed but even it still may not be able to start because Power BI only allows a certain number of jobs to refresh in parallel (see this post for more details and how to control the amount of parallelism). Waiting is described in the example below.
  • Run is the elapsed time between StartedAt and FinishedAt and is the amount of time the job itself took to run.

In this case you can see that the value for Block for XYUnion is 1 minute: X and Y have no preceding jobs so they kick off at the same time, X takes 1 minute to run and Y takes 10 seconds, so it is 1 minute before XYUnion can run. The popups for X and Y show 1 minute and 10 seconds respectively for Run, as you would expect:

One last thing to mention: in the full job graph diagram above you’ll see that certain nodes are highlighted in red. That’s because they are on the critical path, which is documented here; it’s the chain of jobs that dictates the overall length of the refresh, so if you want to make your refresh faster then you need to tune the jobs on the critical path. In this case the critical path goes through X to XYUnion: if you wanted the whole semantic model to refresh faster you would need to tune the refresh for either X or XYUnion; tuning the refresh for Y would make no difference to the overall semantic model refresh time.

Waiting

As I already mentioned, there is a limit on the number of jobs within a refresh that can run in parallel. The maximum number of jobs is represented by the number of “slots” – one slot can only run one job at a time – and in the screenshots of the popups above you can see each job has a slot number. If there are more jobs that could be run than there are slots available at a given time then some jobs have to wait for a slot.

Here’s another example: a semantic model with three tables, A, B and C, which each take 30 seconds to refresh.

There are no dependencies between the tables so in theory each of these three tables could refresh in parallel. However if you refresh with maxParallelism set to two then only two of the three can refresh at any one time. Here’s the job graph for a refresh that does that:

As you can see the critical path goes through the refresh jobs for table A, and hovering over the “Process Partition A[A]” job shows the following:

While this job was not blocked at all because there are no jobs it depends on, it had to wait 30 seconds for a slot to become available; it eventually ran in slot 0. Hovering over the nodes for “Process Partition B[B]” and “Process Partition C[C]” shows that they ran in slots 0 and slot 1 respectively and neither of them were blocked or had to wait.

The job graph isn’t always the best way of visualising this type of parallelism; Phil Seamark’s original Power BI report for visualising refreshes has a page which shows the slots and the jobs in them but I think there’s probably a better way of visualising all of this data that shows slots as well as dependencies. Maybe a Deneb visual is the answer? If anyone has ideas I’d be interested to hear them! In any case, the first step to doing this is to extract all of this data from the .DGML file into a table and that’s what I’ll demonstrate how to do in the next post in this series.

Visualising Power BI Import Mode Refresh Job Graphs

A few years ago a new pair of Profiler events was added for Power BI Import mode datasets (and indeed AAS models): the Job Graph events. I blogged about them here but they never got used by anyone because it was extremely difficult to extract useful data from them – you had to run a Profiler trace, save the trace file, run a Python script to generate a .dgml file, then open that file in Visual Studio – which was a shame because they contain a lot of really interesting, useful information. The good news is that with the release of Semantic Link in Fabric and the ability to run Profiler traces from a Fabric notebook it’s now much easier to access Job Graph data and in this blog post I’ll show you how.

Quick recap: what are the Job Graph events and why are they useful? Let’s say you have a Power BI Import mode semantic model and you want to optimise refresh performance. When you refresh a semantic model, that refresh is made up of multiple jobs which themselves are made up of multiple jobs: refreshing a semantic model involves refreshing all the tables in that model, refreshing a table involves refreshing all the partitions in that table, refreshing a partition involves loading the data and building attribute hierarchies, and so on. Some of these jobs can happen in parallel but in some cases there are dependencies between jobs, so one job can only start when another has completed. The Job Graph events give you information on these refresh jobs and the dependencies between them so you can work out which jobs you need to optimise. In order to capture information from them you need to run a trace while the semantic model is being refreshed; the data from some of these Job Graph events can be reconstituted into a Directed Graph Markup Language (DGML) file, which is an XML-based format, and once you’ve got that you can either visualise the DGML file using a suitable viewer or extract the data from it and analyse it further.

[Before I carry on I have to acknowledge that I’m extremely new at Python and a lot of the code in this post is adapted from the code in my colleague Phil Seamark’s excellent recent post on visualising Power BI refresh information with Semantic Link. Any feedback on ways to optimise the code is gratefully received.]

Here’s some Python code that you can use in a Fabric notebook to run a refresh and generate a DGML file. Each code snippet can be used in a separate code cell or combined into a single cell.

First of all you need to install Semantic Link:

%pip install semantic-link

Next you need to define the events you want in your trace, which in this case are just the Job Graph events:

import sempy.fabric as fabric
import pandas as pd
import time
import warnings

base_cols = ["EventClass", "EventSubclass", "TextData", "IntegerData"]

# define events to trace and their corresponding columns

event_schema = {
"JobGraph": base_cols
}

warnings.filterwarnings("ignore")

You then need to start a trace using this definition, refresh the semantic model, stop the trace and filter the events captured so you only have those with the EventSubclass GraphFinished, remove the event which contains the metadata (which has a value of 0 in the IntegerData column) and then finally sort the rows in ascending order by the values in the IntegerData column:

WorkspaceName = "Insert workspace name here"
SemanticModelName = "Insert semantic model name here"

with fabric.create_trace_connection(SemanticModelName,WorkspaceName) as trace_connection:
# create trace on server with specified events
with trace_connection.create_trace(event_schema, "Simple Refresh Trace") as trace:

trace.start()

# run the refresh
request_status_id = fabric.refresh_dataset(SemanticModelName, WorkspaceName, refresh_type="full")
print("Progress:", end="")

while True:
status = fabric.get_refresh_execution_details(SemanticModelName, request_status_id, WorkspaceName).status
if status == "Completed":
break

print("░", end="")
time.sleep(2)

print(": refresh complete")
# allow ending events to collect
time.sleep(5)

# stop Trace and collect logs
final_trace_logs = trace.stop()



# only return GraphFinished events
final_trace_logs = final_trace_logs[final_trace_logs['Event Subclass'].isin(["GraphFinished"])]
# ignore metadata row
final_trace_logs = final_trace_logs[final_trace_logs['Integer Data'].ne(0)]
# sort in ascending order by Integer Data column
final_trace_logs = final_trace_logs.sort_values(by=['Integer Data'], ascending=True)

Finally, you need to take all the text from the EventText column of the remaining events and concatenate it to get the contents of the DGML file and then save that file to the Files section of the Lakehouse attached to your notebook:

# concatenate all text in TextData column
out = ''.join(final_trace_logs['Text Data'])
# change background colour of critical path nodes so it's easier to see in VS Code
out = out.replace("#263238", "#eba0a7")

# write dgml file
dgmlfile = open("/lakehouse/default/Files/" + request_status_id + ".dgml", 'x')
print (out, file=dgmlfile)
dgmlfile.close()

#dispose of trace connection
trace_connection.disconnect_and_dispose()

I found a nice Visual Studio Code extension called DGMLViewer which makes viewing DGML files easy. Rather than manually downloading the file, OneLake Explorer makes it easy to sync files in OneLake with your PC in a very similar way to OneDrive, which makes working with these DGML files in VS Code very straightforward because you can simply open the local copy when it syncs.

Here’s what one of thse DGML files, generated from the refresh of a very basic semantic model, looks like when viewed in DGML Viewer:

If you have Visual Studio you can also use it to view DGML files (you need to install the DGML Editor first); I found a VS extension called DgmlPowerTools 2022 which adds some advanced features. Here’s what a DGML file for a refresh looks like when visualised in Visual Studio 2022:

OK, so this looks cool but it also looks very complicated. What does it all mean? How can you interpret all this information and use it to optimise a refresh? That’s something for a future blog post!

[In my next post I look at how you can interpret this data and understand the concepts of blocking and waiting, and in the post after that show how you can extract the data in this DGML file to a table using a Dataflow]

Incremental Refresh On Delta Tables In Power BI

One of the coolest features in Fabric is Direct Lake mode, which allows you to build Power BI reports directly on top of Delta tables in your data lake without having to wait for a semantic model to refresh. However not everyone is ready for Fabric yet so there’s also a lot of interest in the new DeltaLake.Table M function which allows Power Query (in semantic models or dataflows) to read data from Delta tables. If you currently have a serving layer – for example Synapse Serverless or Databricks SQL Warehouse – in between your existing lake house and your import mode Power BI semantic models then this new function could allow you to remove it, to reduce complexity and cut costs. This will only be a good idea, though, if refresh performance isn’t impacted and incremental refresh can be made to work well.

So is it possible to get good performance from DeltaLake.Table with incremental refresh? Query folding isn’t possible using this connector because there’s no database to query: a Delta table is just a folder with some files in. But query folding isn’t necessary for incremental refresh to work well: what’s important is that when Power Query filters a table by the datetime column required for incremental refresh, that query is significantly faster than reading all the data from that table. And, as far as I can see from the testing I’ve done, because of certain performance optimisations within DeltaLake.Table it should be possible to use incremental refresh on a Delta table successfully.

There are three factors that influence the performance of Power Query when querying a Delta table:

  1. The internal structure of the Delta table, in particular whether it is partitioned or not
  2. The implementation of the connector, ie the DeltaLake.Table function
  3. The M code you write in the queries used to populate the tables in your semantic model

There’s not much you can do about #2 – performance is, I think, good enough right now although there are a lot of optimisations that will hopefully come in the future – but #1 and #3 are definitely within your control as a developer and making the right choices makes all the difference.

Here’s what I did to test incremental refresh performance. First, I used a Fabric pipeline to load the NYC Taxi sample data into a table in a Lakehouse (for the purposes of this exercise a Fabric Lakehouse will behave the same as ADLSgen2 storage – I used a Lakehouse because it was easier). Then, in Power BI Desktop, I created an import mode semantic model pointing to the NYC taxi data table in the Lakehouse and configured incremental refresh. Here’s the M code for that table:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/unpartitionednyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
#"Filtered Rows" = Table.SelectRows(
ToDelta,
each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd
)
in
#"Filtered Rows"

Here’s the incremental refresh dialog:

I then published the semantic model and refreshed it via the Enhanced Refresh API from a notebook (Semantic Link makes this so much easier) using an effective date of 8th December 2013 to get a good spread of data. I used Phil Seamark’s new, notebook-based version of his refresh visualisation tool to see how long each partition took during an initial refresh:

The refresh took just over 30 minutes.

Next, using Spark SQL, I created a copy of the NYC taxi data table in my Lakehouse with a new datetime column added which removed everything apart from the date and I then partitioned the table by that new datetime column (called PickupDate here):

CREATE TABLE PartitionedByDateNYC

USING delta
PARTITIONED BY (PickupDate)
AS
SELECT *, date_trunc("Day", lpepPickupDateTime) as PickupDate
FROM NYCIncrementalRefreshTest.nyctaxi_raw

I created a copy of my semantic model, pointed it to the new table and reconfigured the incremental refresh to filter on the newly-created PickupDate column:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionedbydatenyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
#"Filtered Rows" = Table.SelectRows(
ToDelta,
each [PickupDate] >= RangeStart and [PickupDate] < RangeEnd
)
in
#"Filtered Rows"

…and refreshed again. This time the refresh took about 26 seconds.

Half an hour to 26 seconds is a big improvement and it’s because the DeltaLake.Table function is able to perform partition elimination: the partitions in the semantic model align to one or more partitions in the Delta table, so when each partition in the semantic model is refreshed Power Query only needs to read data from the partitions in the Delta table that contain the relevant data. This only happens because the filter in the Power Query query using the RangeStart and RangeEnd parameters is on the same column that is used to partition the Delta table.

In my final test I partitioned my Delta table by month, like so:

CREATE TABLE PartitionedNYC

USING delta
PARTITIONED BY (PickupYearMonth)
AS
SELECT *, (100*date_part('YEAR', lpepPickupDateTime)) + date_part('Months', lpepPickupDateTime) as PickupYearMonth
FROM NYCIncrementalRefreshTest.nyctaxi_raw

The challenge here is that:

  1. The new PickupYearMonth column is an integer column, not a datetime column, so it can’t be used for an incremental refresh filter in Power Query
  2. Power BI incremental refresh creates partitions at the year, quarter, month and date granularities, so filtering by month can’t be used for date partitions

I solved this problem in my Power Query query by calculating the month from the RangeStart and RangeEnd parameters, filtering the table by the PickupYearMonth column (to get partition elimination), stopping any further folding using the Table.StopFolding function and then finally filtering on the same datetime column I used in my first test:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionednyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
YearMonthRangeStart = (Date.Year(RangeStart) * 100) + Date.Month(RangeStart),
YearMonthRangeEnd = (Date.Year(RangeEnd) * 100) + Date.Month(RangeEnd),
FilterByPartition = Table.StopFolding(
Table.SelectRows(
ToDelta,
each [PickupYearMonth] >= YearMonthRangeStart and [PickupYearMonth] <= YearMonthRangeEnd
)
),
#"Filtered Rows" = Table.SelectRows(
FilterByPartition,
each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd
)
in
#"Filtered Rows"

Interestingly this table refreshed even faster: it took only 18 seconds.

This might just be luck, or it could be because the larger partitions resulted in fewer calls back to the storage layer. The AzureStorage.DataLake M function requests data 4MB at a time by default and this could result in more efficient data retrieval for the data volumes used in this test. I didn’t get round to testing if using non-default options on AzureStorage.DataLake improved performance even more (see here for more details on earlier testing I did with them).

To sum up, based on these tests it looks like incremental refresh can be used effectively in import mode semantic models with Delta tables and the DeltaLake.Table function so long as you partition your Delta table and configure your Power Query queries to filter on the partition column. I would love to hear what results you get if you test this in the real world so please let me know by leaving a comment.

What Does It Mean To Refresh A Direct Lake Power BI Dataset In Fabric?

If you’ve heard about the new Direct Lake mode for Power BI datasets in Fabric you’ll know that it gives you the query performance of Import mode (well, almost) without the need to actually import any data. Direct Lake datasets can be refreshed though – in fact, they refresh automatically by default – and if you look at the dataset’s Refresh History you’ll see there’s a Direct Lake section which sometimes shows errors:

Also, if you look at a custom dataset’s Settings page (although not yet for a default dataset) you’ll see some properties to control refresh too:

So what does it actually mean to refresh a Direct Lake dataset if it doesn’t involve loading data into the dataset?

The section on Direct Lake refresh in the Fabric docs has the following information:

Invoking a refresh for a Direct Lake dataset is a low cost operation where the dataset analyzes the metadata of the latest version of the Delta Lake table and is updated to reference the latest files in the OneLake.

Let’s see what this means using a simple example. I built a Dataflow Gen2 that loads a single row of data into a table in a Fabric Lakehouse with two columns: one called Sales that always contains the value 1 and one called LoadDate that contains the date and time the dataflow ran:

I ran the dataflow once to load a row of data into a table called MyTable in a Lakehouse:

I then built a custom dataset (because I want to change those refresh options mentioned above) consisting of just this table, and finally a report showing the contents of the table in the dataset:

I then connected SQL Server Profiler to the custom dataset via the XMLA Endpoint and started a trace to capture the Command Begin/End and Progress Report Begin/End events, and refreshed the dataflow (and only the dataflow) to load another row of data into the Lakehouse. Soon after the dataflow refresh finished, the Profiler trace showed a dataset refresh started automatically:

Refreshing the report showed the second row that had just been loaded:

This shows that, with the default settings, a Direct Lake dataset is automatically refreshed when data is loaded into a Lakehouse.

I then went to the Settings pane for the dataset and turned off the “Keep your Direct Lake data up to date” property:

I then ran the dataflow again and this time the Profiler trace showed that no automatic refresh took place; the new row was not shown in the report either. Manually refreshing the dataset from the workspace did result in the new row appearing in the report:

Next, I used a Notebook to delete all the rows from the table:

At this point the report still showed the three rows displayed in the previous screenshot. Finally, I refreshed the dataset one more time and all the data disappeared from the report:

It’s important to stress that the reason that the refresh is needed to show the latest data in the table is not because the data is being loaded into the dataset. It’s because, as the docs say, refresh tells the dataset to look at the latest version of the data in the table – which leads on to the whole topic of time travel in Delta tables in Fabric. Dennes Torres has a nice blog post on this subject here which is a great place to start.

Why would you ever want to refresh a Direct Lake dataset manually? Again, the docs have the answer:

You may want to disable [refresh] if, for example, you need to allow completion of data preparation jobs before exposing any new data to consumers of the dataset. 

So, let’s say you need to load some new data to your table and also delete or update some data that’s already there and this needs to be done as several separate jobs. It’s very likely that you don’t want your Power BI reports to show any of the new data until all of these jobs have been completed, so to do this you will want to disable automatic refresh and do a manual dataset refresh as the last step of your ETL.

Exploring Power BI Import Mode Throughput With Log Analytics, Part 3: Throughput Variations During A Refresh

In the second post in this series I discussed a KQL query that can be used to analyse Power BI refresh throughput at the partition level. However, if you remember back to the first post in this series, it’s actually possible to get much more detailed information on throughput by looking at the ProgressReportCurrent event, which fires once for every 10000 rows read during partition refresh.

Here’s yet another mammoth KQL query that you can use to analyse the ProgressReportCurrent event data:

PowerBIDatasetsWorkspace
| where TimeGenerated > ago(1d)
| where OperationName == "ProgressReportCurrent" or OperationName=="ProgressReportBegin" or OperationName == "ProgressReportEnd"
| where XmlaObjectPath!=""
| where OperationDetailName=="ReadData"
| project OperationName, XmlaObjectPath, EventText, TimeGenerated, ProgressCounter, OperationDetailName, XmlaRequestId, ObjectPathRequestId = strcat(XmlaObjectPath ,XmlaRequestId)
| partition by ObjectPathRequestId
(
    sort by TimeGenerated asc
    | scan declare( PrevEventTimeGenerated:datetime, TimeSincePrevEvent:timespan=time(0s), TimeSinceRefreshBegin:timespan=time(0s), NumberOfRowsRead:int=0) with 
    (
        step RefreshStarts: OperationName=="ProgressReportBegin";
        step RefreshInProgress: OperationName == "ProgressReportCurrent" =>
            PrevEventTimeGenerated = iff(ProgressCounter==10000, RefreshStarts.TimeGenerated, RefreshInProgress.TimeGenerated),
            TimeSincePrevEvent = iff(ProgressCounter==10000, TimeGenerated - RefreshStarts.TimeGenerated, TimeGenerated-RefreshInProgress.TimeGenerated),
            TimeSinceRefreshBegin = TimeGenerated - RefreshStarts.TimeGenerated,
            NumberOfRowsRead = iff(ProgressCounter==10000, 10000, ProgressCounter-RefreshInProgress.ProgressCounter);
        step RefreshEnds: OperationName == "ProgressReportEnd" => 
            PrevEventTimeGenerated = RefreshInProgress.TimeGenerated,
            TimeSincePrevEvent = TimeGenerated - RefreshInProgress.TimeGenerated,
            TimeSinceRefreshBegin = TimeGenerated - RefreshStarts.TimeGenerated,
            NumberOfRowsRead = ProgressCounter - RefreshInProgress.ProgressCounter;
    )
    | where OperationName == "ProgressReportCurrent" or OperationName == "ProgressReportEnd"
)
| project XmlaObjectPath, Table = split(XmlaObjectPath,".", 2)[0], Partition = split(XmlaObjectPath,".", 3)[0], OperationName, 
StartTime = format_datetime(PrevEventTimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), EndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), 
ProgressCounter, SecondsSincePreviousEvent = TimeSincePrevEvent/1s, SecondsSinceRefreshBegin=TimeSinceRefreshBegin/1s,NumberOfRowsRead, 
NumberOfRowsReadPerSecond=NumberOfRowsRead/(TimeSincePrevEvent/1s), XmlaRequestId

Here’s a brief explanation of what it does:

  • It filters the Log Analytics data down to get events from the last day and just the ProgressReportCurrent events, as well as the ProgressReportBegin/End events which are fired before and after the ProgressReportCurrent events.
  • It then splits the data into groups of rows (‘partitions’ in KQL, but of course not the partitions that are being refreshed) by a combination of XmlaRequestId (ie the refresh operation) and XmlaObjectPath (ie the partition that is being refreshed)
  • For each group of rows it will then:
    • Find the ProgressReportBegin event and from this get the time when data started to be read from the source
    • Get all subsequent ProgressReportCurrent events and calculate the amount of time elapsed since the previous event (which might be the ProgressReportBegin event or a previous ProgressReportCurrent event) and the number of rows read
    • When the ProgressReportEnd event is encountered, calculate the amount of time elapsed since the previous ProgressReportCurrent event and the number of rows (which will be less than 10000) read since then
  • Filter out the ProgressReportBegin events because we don’t need them any more
  • Finally, add columns that splits out the table name and partition name and calculates the number of rows read per second for each row by dividing the number of rows read for each event by the amount of time elapsed since the previous event

What can this query tell us about throughput?

First of all, something interesting but not necessarily useful. At least for the data source I’m using for my tests, when I plot a column chart with the number of rows read on the x axis and the amount of time elapsed since the last event on the y axis (ie the amount of time it takes to read 10000 rows for all but the last column) then I noticed that every 200000 rows something happens to slow down the read:

I have no idea what this is, whether it’s a quirk of this particular source or connector, but it’s a great example of the kind of patterns that become obvious when you visualise data rather than look at a table of numbers.

Plotting time on the x axis of a line chart and the cumulative total of rows read on the y axis gives you something more useful. Here’s the chart for one of the refreshes mentioned in my last post where four partitions of the same table are refreshed in parallel:

In this case throughput is fine up until the end of the refresh at which point something happens to the February, March and April partitions but not the January partition to slow them down for about 30 seconds, after which throughput goes back to what it was before. Here’s the same chart zoomed in a bit:

Here’s the same problem shown in the first graph above, where the number of rows read is on the x axis, showing how for example with the April partition there’s a sudden spike where it takes 14 seconds to read 10000 rows rather than around 0.3 seconds:

What is this, and why isn’t the January partition affected? Maybe it was a network issue or caused by something happening in the source database? Looking at another refresh that also refreshes the same four partitions in parallel, it doesn’t seem like the same thing happens – although if you look closely at the middle of the refresh there might be a less pronounced flattening off:

Again, the point of all this is not the mysterious blips I’ve found in my data but the fact that if you take the same query and look at your refreshes, you may find something different, something more significant and something you can explain and do something about.

Exploring Power BI Import Mode Throughput With Log Analytics, Part 2: High-Level Statistics

In the first post in this series I described the events in Log Analytics that can be used to understand throughput – the speed that Power BI can read from your dataset when importing data from it – during refresh. While the individual events are easy to understand when you look at a simple example they don’t make it easy to analyse the data in the real world, so here’s a KQL query that takes all the data from all these events and gives you one row per partition per refresh:

//Headline stats for partition refresh with one row for each partition and refresh
//Get all the data needed for this query and buffer it in memory
let RowsForStats = 
materialize(
  PowerBIDatasetsWorkspace
| where TimeGenerated > ago(1d)
| where OperationName == "ProgressReportEnd"
| where OperationDetailName == "ExecuteSql" or OperationDetailName == "ReadData" 
or (OperationDetailName == "TabularRefresh" and (EventText contains "partition"))  
);
//Get just the events for the initial SQL execution phase
let ExecuteSql = 
RowsForStats
| where OperationDetailName == "ExecuteSql"
| project XmlaRequestId, XmlaObjectPath, 
ExecuteSqlStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ), 
ExecuteSqlEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), 
ExecuteSqlDurationMs = DurationMs, ExecuteSqlCpuTimeMs = CpuTimeMs;
//Get just the events for the data read and calculate rows read per second
let ReadData =
RowsForStats
| where OperationDetailName == "ReadData"
| project XmlaRequestId, XmlaObjectPath, 
ReadDataStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ), 
ReadDataEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), 
ReadDataDurationMs = DurationMs, ReadDataCpuTime = CpuTimeMs, 
TotalRowsRead = ProgressCounter, RowsPerSecond = ProgressCounter /(toreal(DurationMs)/1000);
//Get the events for the overall partition refresh
let TabularRefresh =
RowsForStats
| where OperationDetailName == "TabularRefresh"
| parse EventText with * '[MashupCPUTime: ' MashupCPUTimeMs:long ' ms, MashupPeakMemory: '  MashupPeakMemoryKB:long ' KB]'
| project XmlaRequestId, XmlaObjectPath, 
TabularRefreshStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ), 
TabularRefreshEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), 
TabularRefreshDurationMs = DurationMs, TabularRefreshCpuTime = CpuTimeMs, 
MashupCPUTimeMs, MashupPeakMemoryKB;
//Do an inner join on the three tables so there is one row per partition per refresh
ExecuteSql
| join kind=inner ReadData on XmlaRequestId, XmlaObjectPath
| join kind=inner TabularRefresh on XmlaRequestId, XmlaObjectPath
| project-away XmlaRequestId1, XmlaRequestId2, XmlaObjectPath1, XmlaObjectPath2
| extend Table = tostring(split(XmlaObjectPath,".", 2)[0]), Partition = tostring(split(XmlaObjectPath,".", 3)[0])
| project-reorder XmlaRequestId, Table, Partition
| order by XmlaRequestId, ExecuteSqlStartTime desc

It’s a bit of a monster query but what it does is quite simple:

  • First it gets all the events relating to partition refresh in the past 1 day (which of course you can change) and materialises the results.
  • Then it filters this materialised result and gets three sets of tables:
    • All the ExecuteSql events, which tell you how long the data source took to start returning data and how much CPU time was used.
    • All the ReadData events, which tell you how long Power BI took to read all the rows from the source after the data started to be returned, how much CPU time was used, and how many rows were read. Dividing duration by rows read lets you calculate the number of rows read per second during this phase.
    • All the TabularRefresh events, which give you overall data on how long the partition refresh took, how much CPU time was used, plus information on Power Query peak memory usage and CPU usage.

What can this tell us about refresh throughput though? Let’s use it to answer some questions we might have about throughput.

What is the impact of parallelism on throughput? I created a dataset on top of the NYC taxi data Trip table with a single table, and in that table created four partitions containing data for January, February, March and April 2013, each of which contained 13-15 million rows. I won’t mention the type of data source I used because I think it distracts from what I want to talk about here, which is the methodology rather than the performance characteristics of a particular source.

I then ran two refreshes of these four partitions: one which refreshed them all in parallel and one which refreshed them sequentially, using custom TSML refresh commands and the maxParallelism property as described here. I did a refresh of type dataOnly, rather than a full refresh, in the hope that it would reduce the number of things happening in the Vertipaq engine during refresh that might skew my results. Next, I used the query above as the source for a table in Power BI (for details on how to use Log Analytics as a source for Power BI see this post; I found it more convenient to import data rather than use DirectQuery mode though) to visualise the results.

Comparing the amount of time taken for the SQL query used to start to return data (the ExecuteSqlDurationMs column from the query above) for the four partitions for the two refreshes showed the following:

The times for the four partitions vary a lot for the sequential refresh but are very similar for the parallel refresh; the January partition, which was refreshed first, is slower in both cases. The behaviour I described here regarding the first partition refreshed in a batch could be relevant.

Moving on to the Read Data phase, looking at the number of rows read per second (the RowsPerSecond column from the query above) shows a similar pattern:

There’s a lot more variation in the sequential refresh. Also, as you would expect, the number of rows read per second is much higher when partitions are refreshed sequentially compared to when they are refreshed in parallel.

Looking at the third main metric, the overall amount of time taken to refresh each partition (the TabularRefreshDurationMs column from the query above) again shows no surprises:

Each individual partition refreshes a lot faster in the sequential refresh – almost twice as fast – compared to the parallel refresh. Since four partitions are being refreshed in parallel during the second refresh, though, this means that any loss of throughput for an individual partition as a result of refreshing in parallel is more than compensated for by the parallelism, making the parallel refresh faster overall. This can be shown using by plotting the TabularRefreshStartTime and TabularRefreshEndTime columns from the query above on a timeline chart (in this case the Craydec Timelines custom visual) for each refresh and each partition:

On the left of the timeline you can see the first refresh where the partitions are refreshed sequentially, and how the overall duration is just over 20 minutes; on the right you can see the second refresh where the partitions are refreshed in parallel, which takes just under 10 minutes. Remember also that this is just looking at the partition refresh times, not the overall time taken for the refresh operation for all partitions, and it’s only a refresh of type dataOnly rather than a full refresh.

So does this mean more parallelism is better? That’s not what I’ve been trying to say here: more parallelism is better for overall throughput in this test but if you keep on increasing the amount of parallelism you’re likely to reach a point where it makes throughput and performance worse. The message is that you need to test to see what the optimal level of parallelism – or any other factor you can control – is for achieving maximum throughput during refresh.

These tests only show throughput at the level of the ReadData event for a single partition, but as mentioned in my previous post there is even more detailed data available with the ProgressReportCurrent event. In my next post I’ll take a closer look at that data.

[Thanks to Akshai Mirchandani for providing some of the information in this post, and hat-tip to my colleague Phil Seamark who has already done some amazing work in this area]

Exploring Power BI Import Mode Throughput With Log Analytics, Part 1: The Events

If you’re tuning refresh for a Power BI Import mode dataset one of the areas you’ll be most interested in is throughput, that is to say how quickly Power BI can read data from the data source. It can be affected by a number of different factors: how quickly the data source can return data; network latency; the efficiency of the connector you’re using; any transformations in Power Query; the number of columns in the data and their data types; the amount of other objects in the same Power BI dataset being refreshed in parallel; and so on. How do you know if any or all of these factors is a problem for you? It’s a subject that has always interested me and now that Log Analytics for Power BI datasets is GA we have a powerful tool to analyse the data, so I thought I’d do some testing and write up my findings in a series of blog posts.

The first thing to understand is what events there are in Log Analytics that provide data on throughput. The events to look at are ProgressReportBegin, ProgressReportCurrent and ProgressReportEnd (found in the OperationName column), specifically those with OperationDetailName ExecuteSql, ReadData and Tabular Refresh. Consider the following KQL query that looks for at this data for a single refresh and for a single partition:

let RefreshId = "e5edc0de-f223-4c78-8e2d-01f24b13ccdc";
let PartitionObjectPath = "28fc7514-d202-4969-922a-ec86f98a7ea2.Model.TIME.TIME-ffca8cb8-1570-4f62-8f04-993c1d1d17cb";
PowerBIDatasetsWorkspace
| where TimeGenerated > ago(3d)
| where XmlaRequestId == RefreshId
| where XmlaObjectPath == PartitionObjectPath
| where OperationDetailName == "ExecuteSql" or OperationDetailName == "ReadData" or OperationDetailName == "TabularRefresh"
| project XmlaObjectPath, Table = split(XmlaObjectPath,".", 2)[0], Partition = split(XmlaObjectPath,".", 3)[0], 
TimeGenerated, OperationName, OperationDetailName, EventText, DurationMs, CpuTimeMs, ProgressCounter
| order by XmlaObjectPath, TimeGenerated asc;

Some notes on what this query does:

  • All the data comes from the PowerBIDatasetsWorkspace table in Log Analytics
  • I’ve put an arbitrary filter on the query to only look for data in the last three days, which I know contains the data for the specific refresh I’m interested in
  • An individual refresh operation can be identified by the value in the XmlaRequestId column and I’m filtering by that
  • An individual partition being refreshed can be identified by the value in the XmlaObjectPath column and I’m filtering by that too
  • The value in the XmlaObjectPath column can be parsed to obtain both the table name and partition name
  • The query filters the events down to those mentioned above

Here are some of the columns from the output of this query, showing data for the refresh of a single table called TIME with a single partition:

Some more notes on what we can see here:

  • The TimeGenerated column gives the time of the event. Although the time format shown here only shows seconds, it actually contains the time of the event going down to the millisecond level – unlike in a Profiler trace, where time values are rounded to the nearest second and are therefore lot less useful.
  • The first event returned by this query is a ProgressReportBegin event of type TabularRefresh which marks the beginning of the partition refresh.
  • As I blogged here, after that are a pair of ProgressReportBegin/End events of type ExecuteSql. The value in the DurationMs column tells you how long it takes for the data source (which includes time taken by the actual query generated against the data source and any Power Query transformations) to start to return data – which was, in this case, 6016 milliseconds or 6 seconds.
  • Next there is a ProgressReportBegin event which indicates the beginning of data being read from the source.
  • After that there are a series of ProgressReportCurrent events which mark the reading of chunks of 10000 rows from the source. The ProgressCounter column shows the cumulative number of rows read.
  • Next there is a ProgressReportEnd event that marks the end of the data read. The ProgressCounter shows the total number of rows read (which must be less than 10000 rows greater than the value in the ProgressCounter column for the previous ProgressReportCurrent event); the DurationMs column shows the total time taken to read the data. In this case 86430 rows of data were read in 1.4 seconds.
  • Finally there is a ProgressReportEnd event of type TabularRefresh, the pair of the first event shown returned. It not only shows the total amount of time taken and the CPU time used to refresh the partition (which includes other operations that are nothing to do with throughput such as compressing data and building dictionaries) in the DurationMs column, as I blogged here it also includes data on the CPU and peak memory used by Power Query during the refresh. In this case the total refresh time was 7.6 seconds, so not much more than the time taken to read the data.

Clearly there’s a lot of useful, interesting data here, but how can we analyse it or visualise it? Stay tuned for my next post…

Monitoring Power BI Dataset Refresh Memory And CPU Usage With Log Analytics

Maybe the fourth- or fifth-most exciting Power BI-related announcement last month (admittedly it was an exciting month) was that Log Analytics for Power BI datasets is now GA and you can now link multiple Power BI workspaces to a single Log Analytics workspace. This, for me, means that enabling Log Analytics has gone from being useful to essential for anyone interested in monitoring Analysis Services engine activity in an enterprise Power BI/Fabric deployment. It also works with Direct Lake datasets too!

Anyway, following on from my recent posts on measuring memory and CPU usage during dataset refresh, both for the refresh as a whole and for individual partitions, here are some KQL queries that allow you to extract that information from Log Analytics.

Here’s the KQL query to get the memory and CPU usage for a refresh (see this post for details):

PowerBIDatasetsWorkspace
| where OperationName == "CommandEnd"
| parse-where EventText with * '[PeakMemory: ' PeakMemoryKB:long ' KB, MashupCPUTime: ' MashupCPUTimeMs:long ' ms, MashupPeakMemory: '  MashupPeakMemoryKB:long ' KB]'
| extend DatasetName=ArtifactName
| project TimeGenerated, CpuTimeMs, DurationMs, MashupCPUTimeMs, MashupPeakMemoryKB, PeakMemoryKB, XmlaRequestId, DatasetName, PowerBIWorkspaceName

And here’s the KQL query to get the memory and CPU usage for individual partitions/Power Query queries (see this post for details):

PowerBIDatasetsWorkspace
| where OperationName == "ProgressReportEnd"
| parse-where EventText with * '[MashupCPUTime: ' MashupCPUTimeMs:long ' ms, MashupPeakMemory: '  MashupPeakMemoryKB:long ' KB]'
| extend Table=split(XmlaObjectPath,'.').[2], Partition=split(XmlaObjectPath,'.').[3]
| extend DatasetName=ArtifactName
| project TimeGenerated, Table, Partition, CpuTimeMs, DurationMs, MashupCPUTimeMs, MashupPeakMemoryKB, XmlaRequestId, DatasetName, PowerBIWorkspaceName

You could use these queries in a Power BI DirectQuery dataset to report on all your refreshes, similar to what I did in this series of posts last year. The XmlaRequestId columns contains the unique identifier for the refresh operation so you could build a one-to-many relationship between the table with the refresh-level data and the table with the partition-level data. Maybe I’ll get round to building that report sometime…

[Thanks to Jast Lu, who wrote the original version of these queries]

Power BI Automatic Refresh Retries

One of the most confusing things about troubleshooting Power BI refresh problems is the way Power BI will sometimes try running a refresh again after it has failed. It means that refreshes seem to take a lot longer than you would expect and also that more queries are run against your data source than you would expect. I don’t have all the answers but I thought it would be useful to highlight a few scenarios where it does happen.

A while ago I wrote a series of posts on dataset refresh timeouts which started out with this post on the fact that no refresh in the Power BI portal can last more than 2 hours (in Shared/Pro capacity) or 5 hours (in Premium), as documented here. While writing that post I built a dataset that take more than 5 hours to refresh; back then, and testing it again now, I found that it seemed to take a lot longer than 5 hours to fail.

What happened? I kicked off the refresh at 07:33 UTC one morning. With Log Analytics enabled on the workspace I could run the following KQL query to see what happened five hours later:

PowerBIDatasetsWorkspace
| where TimeGenerated>datetime(2023-05-05 12:32:00) and TimeGenerated<datetime(2023-05-05 12:35:00)
| where OperationName in ('CommandBegin', 'CommandEnd', 'Error', 'ProgressReportBegin', 'ProgressReportEnd', 'ProgressReportError')
| where OperationDetailName<>'ExtAuth'
| project TimeGenerated, OperationName, OperationDetailName, DurationMs, CpuTimeMs, EventText, Status, StatusCode
| order by TimeGenerated asc

Sure enough, around 12:34 UTC I could see errors happening for the existing refresh, then (on line 22 in the screenshot below) a CommandEnd event corresponding to the refresh that failed and which had run for five hours. After that (on line 25 in the screenshot below) there was a CommandBegin event for a new refresh on the same dataset:

Interestingly, even though I could see from Log Analytics that the refresh had restarted and I could also see the refresh “spinny” spinning in the workspace, the Refresh History dialog was already showing a timeout:

[I’m told that we have plans to improve this and provide more details on what’s happening in the UI in the future]

The same thing happened again with the next refresh failing at 17:34 UTC that day, five hours later, again at 23:36 UTC, again at 03:46 UTC the next day; there was then a delay of one hour when the refresh started again, at 04:35 UTC and the final failure came one hour after that at 05:37 UTC. Lots of failures and lots of retries.

I happened to capture a similar refresh retry in a Profiler trace recently. I scheduled a refresh for 23:00 UTC on a busy Premium capacity and saw the following:

This trace shows:

  • No activity before 23:05, which I think shows how busy the capacity was at the time
  • The first attempted refresh kicked off at 23:05:10 (the first highlighted pair of Command Begin/End events in the screenshot) and failed immediately.
  • The error message associated with this failure was “You’ve exceeded the capacity limit for dataset refreshes. Try again when fewer datasets are being processed”, an error documented here. This was caused by reaching the limit for the number of concurrent datasets that can be refreshed on a single capacity (as documented in the Model Refresh Parallelism column of the table here).
  • The dataset refresh kicked off again at 23:06:10 and finished successfully after 42 seconds (the second highlighted pair of Command Begin/End events in the screenshot above).

So two scenarios where Power BI retried a refresh after an initial failure. As I said, though, not all refresh failures result in retries. For example, when I tried refreshing the dataset in this post which fails with a timeout on the SQL Server side, I could see that Power BI only ran the refresh command once:

I don’t know when the Power BI Service does and doesn’t retry a refresh after a failure but I’m trying to find out; I’ll blog again if I do get more information.

While you don’t have any control over retry behaviour if you schedule a refresh in the Power BI Service or kick off a manual refresh from the portal, you do get more control if you refresh using the Enhanced Refresh API and set the retryCount parameter. Indeed the docs for the Enhanced Refresh API have some extra detail about how retryCount and the five hour timeout interact here, mentioning that even if you keep retrying a refresh it will always fail after 24 hours and that the total time for a refresh includes the time taken for all failures before a success. This, for me, is one more reason to use the Enhanced Refresh API for refreshing enterprise-level datasets.