First Look At Fabric Graph: Analysing Power BI Import Mode Refresh Job Graphs

The new Fabric Graph database is now rolling out and should be available to everyone within the next few weeks if you can’t see it already. The key to learning a new data-related technology is, I think, to have some sample data that you’re interested in analysing. But if you’re a Power BI person why would a graph database be useful or interesting? Actually I can think of two scenarios: analysing dependencies between DAX calculations and the tables and columns they reference using the data returned by INFO.CALCDEPENDENCY function (see here for more details on what this function does); and the subject of this blog post, namely analysing Import mode refresh job graphs.

I’m sure even some of the most experienced Power BI developers reading this are now wondering what an Import mode refresh job graph is, so let me remind you a series of three posts I wrote early in 2024 on extracting the job graph events from a refresh using Semantic Link Labs and visualising them, understanding the concepts of blocking and waiting in an Import mode refresh, and crucially for this post how to save job graph information to a table in OneLake. Here’s a quick explanation of what a refresh job graph is using the model from the second post though. Let’s say you have an Import mode semantic model consisting of the following three tables:

X and Y are tables that contain a single numeric column. XYUnion is a calculated table which unions the tables X and Y and has the following DAX definition:

XYUnion = UNION(X,Y)

If you refresh this semantic model the following happens:

  1. The Power BI engine creates a job to refresh the semantic model
  2. This in turn kicks off two jobs to refresh the tables X and Y
  3. Refreshing table X kicks off jobs to refresh the partitions in table X and the attribute hierarchies in table X
  4. Refreshing table Y kicks off jobs to refresh the partitions in table Y and the attribute hierarchies in table Y
  5. Once both table X and Y have been refreshed the calculated table XYUnion can be refreshed, which in turn kicks off jobs to refresh the attribute hierarchies in table XYUnion

So you can see that refreshing an Import mode model results in the creation of refresh jobs for individual objects which have a complex chain of dependencies between them. If you want to tune an Import mode model refresh then understanding this chain of dependencies can be really useful. Running a Profiler trace while a refresh is happening and capturing the Job Graph trace events gives you all the data needed to do this.

I refreshed the model shown above and saved the Job Graph data for it to two tables in OneLake using the code in this post. The first table was called RefreshJobs and contained one row for each job created during the refresh:

The second table contained all the dependencies between the jobs and was called Links:

I then created a new Graph model in my workspace, clicked Get Data, selected the lakehouse where the two tables above were stored, selected those two tables:

…and clicked Load. Then in the model editor I clicked Add Node and created a node called RefreshJobs from the RefreshJobs table:

And then I clicked Add Edge and created an edge called DependsOn from the Links table:

Next I clicked Save to load all the data into the graph model (this very similar refreshing a semantic model).

This resulted in a simple graph model which represented the recursive relationship between the jobs in a refresh:

I was then able to use the query builder to create a diagram showing all the dependencies between the jobs:

It’s not as pretty as the tools for viewing DGML files I showed in this post last year but it has the advantage of allowing you to filter on the properties of nodes and edges. If you know GQL (and I wrote my first GQL query all of two hours ago…) you can write queries to do much more advanced types of analysis. Here’s a GQL query I managed to write which returns all the jobs that depend on the job with the JobId 1, and all the jobs that depend on those jobs:

MATCH 
(source_Job:Job)-[DependsOn:DependsOn]->{1,2}(target_Job:Job)
WHERE target_Job.jobId=1
RETURN 
target_Job.jobId,
target_Job.description,
source_Job.jobId,
source_Job.description

This is really cool stuff and in particular I would love to learn a bit more GQL to understand how the dependencies between objects and the amount of parallelism possible during a refresh affect refresh performance. If I get the time to do so I’ll write more blog posts!

Refreshing Hidden Tables In Power BI Semantic Models With Fabric Data Pipelines

Following on from my recent post about refreshing semantic models with Fabric Data Pipelines and the semantic model refresh activity, a few people asked me how to refresh hidden tables because they are not displayed in the Pipeline configuration UI. I got the answer from my colleague Alex Powers (aka reddit celebrity u/itsnotaboutthecell) who kindly allowed me to blog about it.

To demonstrate how to do this, I created a semantic model with two tables: one visible, called VisibleTable, and one hidden, calledHiddenTable.

I then published the semantic model, created a Data Pipeline and added a semantic model refresh activity; selected the connection, workspace and semantic model; waited for the Table(s) dropdown to populate (yes I know it’s slow, we’re working on it):

…and then, when it loaded, noted that only the visible table was shown in the dropdown:

I didn’t select anything and instead clicked “Add dynamic content” to use an expression to select the table instead:

Then in the Pipeline expression builder I entered the following:

@json('
    [
        {
            "table":"HiddenTable"
        }
    ]
')

Having done this I ran the Pipeline and just the hidden table was refreshed. Easy!

The expression needs to be a JSON array of table and partition names. Here’s an example showing how to refresh the table called HiddenTable and the sole partition of the table called VisibleTable (which also happens to be called VisibleTable) in the same refresh:

@json('
    [
        {
            "table":"HiddenTable"
        },
        {
            "table": "VisibleTable",
            "partition": "VisibleTable"
        }
    ]
')

It’s useful to know how to construct the expression even if you don’t need to refresh hidden tables – for example, you might want to dynamically generate the list of tables or partitions to refresh with an expression.

Another Way To Visualise Import Mode Semantic Model Partition Refresh Parallelism

I’m a big fan of Phil Seamark’s “visualise your refresh” pbix file, which allows you to analyse the different operations of an Import mode semantic model refresh operation (similar functionality has now been built into Semantic Link Labs now too). In particular I like how it allows you to visualise operations inside a refresh as a Gantt chart, which makes it easy to see how much parallelism you’re getting when you refresh – and as I discussed here, the amount of parallelism can have a massive effect on how long a refresh takes. However Gantt charts are just one way to visualise this data and after one of my colleagues showed me the KQL range() function I realised there was a different way to approach this problem using KQL and Fabric Workspace Monitoring.

Here’s the KQL query I came up with to find the number of partitions that are refreshing in parallel at any given point in time when a semantic model is refreshed:

SemanticModelLogs
//Filter by Timestamp for performance
| where Timestamp > ago(1h)
//Filter by the OperationId of the refresh
| where OperationId == "insert OperationId of refresh here"
//Only get the events fired when partition refresh finishes
| where OperationDetailName =="Process" and OperationName =="ProgressReportEnd"
//Find the start time of the partition refresh
| extend StartTime = datetime_add("Millisecond", -1 * DurationMs, Timestamp)
//Create a list of all the seconds where the refresh was in progress
| extend StartSecond = range(bin(StartTime,1s), bin(Timestamp,1s), 1s)
| mv-expand StartSecond
//Aggregate the data and get the number of partitions being refreshed in each one-second bin
| summarize PartitionsBeingRefreshed=count() by todatetime(StartSecond)
//Sort in ascending order by one-second bins
| order by StartSecond asc

What this query does is:

  • Filter the events in the SemanticModelLogs table to just those for the refresh whose OperationId you enter
  • Filter again so you only get the ProgressReportEnd events that are fired when a partition refresh finishes
  • Calculates the start time of the partition refresh by taking the finish time and subtracting the duration of the refresh
  • Creates a table with one row for each second that each partition refresh was in progress
  • Aggregates this table so you get the number of partition refresh operations that were in progress for each one second bin

All events associated with a single semantic model refresh can be identified by the value in the OperationId column in the SemanticModelLogs table in Workspace Monitoring. To find it, just run a KQL query on the SemanticModelLogs table to get all the events from the time period when your refresh took place; it will be easy to spot the events associated with the refresh by looking at the contents of the EventText column, so all you need to do is copy the value from the OperationId column from one of these events.

I refreshed a semantic model with six dimension tables and one fact table containing ten partitions and here’s what the query above showed when visualised as a line chart in a KQL Queryset:

As you can see, at first four partitions are refreshing in parallel; this quickly drops and then rises, and for the first half of the refresh six partitions are refreshing in parallel. At about the two-thirds point this drops to four partitions refreshing in parallel and as they complete the line chart drops to one. Since six partitions refreshing in parallel is the maximum I allowed for this refresh you can see it’s reasonably efficient.

To get more detail on what is being refreshed, here’s a variation of the KQL query above that doesn’t perform the final aggregation:

SemanticModelLogs
| where Timestamp > ago(1h)
| where OperationId == "54d42645-9672-409a-844c-42403526b596"
| where OperationDetailName =="Process" and OperationName =="ProgressReportEnd"
| extend StartTime = datetime_add("Millisecond", -1 * DurationMs, Timestamp)
| extend StartSecond = range(bin(StartTime,1s), bin(Timestamp,1s), 1s)
| mv-expand StartSecond
| project StartSecond = todatetime(StartSecond), Partition = replace_strings(substring(EventText, 24), dynamic(["<oii>", "</oii>"]),dynamic(["",""]))

This data can then be used in a Power BI report and visualised as a stacked bar chart:

The advantage of this is that you can see which partitions are refreshing when: in this case you can see that it was the partitions from the dimension tables that refreshed first, followed by the partitions from the fact table.

If you’re not using Fabric Workspace Monitoring it should be easy to adapt the queries above to work with Power BI Log Analytics integration instead – it’s just a matter of changing a few column names.

There’s one problem I can think of with this approach: if there are multiple small partitions being refreshed that take under a second there’s a chance that you’ll see inflated parallelism numbers in the results at some points. I don’t think that’s a major problem though and overall I think this is quite a useful way to understand how much parallelism you’re getting during a refresh.

[Thanks to Matthew Farrow for the inspiration here – check out his excellent, detailed series of articles on LinkedIn starting here about understanding Fabric capacities and billing]

Monthly Power BI Semantic Model Refreshes With Fabric Data Pipelines

I’m sure you already know how to configure scheduled refresh for your semantic models in Power BI. While the options you have for controlling when refresh takes place are generally good enough – you can configure daily or weekly refreshes and set up to eight times a day for refreshes to take place – there are some scenarios it doesn’t work for, such as monthly refreshes. Up to now the workaround has been to use Power Automate to trigger refreshes (see here for an example) or to call the refresh API from another application. Now, with Fabric, you have a much better option for scheduling refreshes: Data Pipelines.

The semantic model refresh activity in Fabric Data Pipelines was released last year and at the time all the focus was on the extra control it gives you over what happens in a refresh: it allows you to refresh individual tables or partitions and control the amount of parallelism, for example; it also allows you to schedule your refresh after other ETL operations, which in Fabric will probably also be scheduled using Pipelines, have completed. What I want to draw your attention to is the fact that Fabric Data Pipelines use the new Fabric scheduler which offers more flexibility for controlling when they run.

There’s some documentation here on how to schedule a Data Pipeline run but it’s very straightforward to use. I created a Fabric Pipeline consisting of a single semantic model refresh activity like so:

..then hit the Schedule button on the toolbar, selected “Monthly” on the Repeat dropdown and configured it to run on the first Sunday of the month:

Apart from the option to run on the Nth instance of a given day of the week you can also run the Data Pipeline on a given day number of the month; you can also run every N months, add up to 10 times to run per day, and set a start and end date.

There are two other options for scheduling that aren’t available in the scheduler for semantic models: the ability to run the Data Pipeline every N hours or minutes.

Be warned though: refreshing your semantic model every few minutes is dangerous because it can result in excessive CU consumption on your capacity and maybe even throttling if you’re not careful.

The same options to run a Data Pipeline daily and weekly that exist in the scheduler for semantic models with one notable limitation: the semantic model scheduler allows you to specify up to 48 times to refresh every day for models stored on a Premium/Fabric capacity, whereas the Fabric scheduler used by Pipelines only allows you to specify 10 times per day.

Of course you need a capacity to be able to use Fabric Data Pipelines but orchestration activities only cost 0.0056CU hours per activity run, so using Pipelines to refresh a semantic model in this way will only use a tiny fraction of even the smallest capacity’s resources.

Even if you don’t think you’re interested in anything that Fabric offers beyond Power BI, it’s features like this that, in my opinion, still make it worthwhile to flip the switch to enable Fabric and make your life as a Power BI developer easier.

Refreshing A Power BI Semantic Model With Eventstreams, Pipelines And Activator

Following on from my last post where I showed how to send data from Power Automate to a Fabric Eventstream, in this post I’m going to show how to use it to solve one of my favourite problems: refreshing a Power BI semantic model that uses an Excel workbook stored in OneDrive as a source when that Excel workbook is modified.

Now before I go on I want to be clear that I know this is a ridiculously over-engineered and expensive (in terms of CUs) solution, and that you can do almost the same thing just using Power Automate or in several other different ways – see my colleague Mark Pryce-Maher’s recent videos on using Fabric Open Mirroring with Excel for example. I’m doing this to teach myself Fabric Eventstreams and Activator and see what’s possible with them. Please excuse any mistakes or bad practices.

To start off, I built a Power BI semantic model containing a single table, connected to an Excel workbook stored in OneDrive for Business:

I then built a Fabric Pipeline to refresh the semantic model, using the newly-enhanced Semantic Model Refresh activity:

Next, I created a new Fabric Eventstream and created a slightly more sophisticated version of the Power Automate flow in my previous post that ran every 30 seconds, checked to see if the Excel file was modified, and if it was, sent an event to the Eventstream:

The Compose action in this flow generates some JSON that is sent to the Eventstream and contains the name of the file that has been altered and a dummy value called EventCounter that always contains 1:

I then connected the Eventstream to a new Activator:

In the Activator, I created a new object called ExcelWorkbook. I used the FileName field as the unique identifier and added the EventCounter field as a property:

Finally I created a new rule on the EventCounter property that aggregated the events over 1 minute windows and then fired when the total number of events in each window changed to the value 0 and stayed at 0 for 1 minute:

This is where the real value of Activator’s rules comes in. As I said, it’s easy using other approaches to refresh a semantic model when an Excel file changes. The problem with doing this, however, is that Excel files in OneDrive for Business autosave on a regular basis and if you make several changes to the file over the space of several minutes, you’ll end up trying to refresh the semantic model all the time. This can lead to excessive CU usage on your capacity because you’re refreshing too often as well as situations where one refresh fails because another is already in progress. What the rule above does is only trigger a refresh when changes to the Excel workbook have been detected (so the Total is greater than 0), then there has been a minute where no changes have been detected (so the Total changes to 0) and there has been a further minute where no changes have been detected (so the Total stays at 0). Since Activator rules can trigger Fabric items, I hooked my rule up to the Pipeline shown above to run the refresh:

This screenshot of the Definition tab of the rule shows a pattern of events where there were four one-minute windows where the Excel file was edited and the rule fired three times:

The problem with this rule is that if the Excel file is edited constantly for a long period, so that there is no period of inactivity, the semantic model would not get refreshed until that period of activity is over, which might be too long to wait. I guess the solution would be another rule that detects periods of, say, 10 minutes when the file has been edited constantly and refreshes the model even if there is no period of inactivity.

All in all I think this shows how the combination of Power BI/Power Platform and new features in Fabric can be used to solve old problems in new, interesting ways. I’m looking forward to learning more about Fabric Real-Time Intelligence and Activator!

[Thanks to James Hutton from the Activator team for his help with this post]

The partialBatch Commit Mode In The Power BI Enhanced Refresh API

I have always wondered what the partialBatch option for the commitMode parameter in the Enhanced Refresh API does exactly. There is some documentation here and here but I was curious to find out more as part of the research I’m doing for my ongoing series on Power BI refresh memory errors, in case it was useful for reducing overall memory usage (spoiler: it may be). In this post I’ll share what I found out after running some tests.

For my testing I created a semantic model with four tables called A, B, C and D. Each semantic model table used an M expression similar to the following as its source:

Function.InvokeAfter(
  () => #table(type table [MyCol = text], {{"A"}}),
  #duration(0, 0, 0, 10)
)

This expression – the expression for table A – returns a table with one column and one row after a delay of 10 seconds; the expressions for tables B, C and D were almost identical but had delays of 20, 30 and 40 seconds respectively. These delays ensured that when the tables refreshed they always finished in a set order if they were refreshed in parallel. There were no relationships or measures in the model.

I published this model and then created a notebook to refresh it using Semantic Link’s refresh_dataset method (hat tip to Phil Seamark, whose code I stole) which uses the Enhanced Refresh API behind the scenes:

import sempy.fabric as fabric
WorkspaceName = "CW partialBatch Tests"
SemanticModelName = "partialBatchTest"

# run the refresh 
request_status_id = fabric.refresh_dataset(dataset=SemanticModelName, workspace=WorkspaceName, refresh_type="full", commit_mode="partialBatch", retry_count=1, max_parallelism=4)
print("Progress:", end="")

while True:
    status = fabric.get_refresh_execution_details(SemanticModelName, request_status_id, WorkspaceName).status
    if status == "Completed":
        break
        print("░", end="")
        time.sleep(2)

print(": refresh complete")

Note that the commit_mode parameter is set to partialBatch, refresh_type is set to full and that, at this point, max_parallelism was set to 4, which is the number of tables in the model and which meant that all four tables refreshed in parallel.

I ran a Profiler trace while running this code to refresh the model and observed the following happening:

  • First, a refresh of type ClearValues (which clears all data from a table) was run in a single transaction for all four tables. See here for more details about each type of refresh.
  • Next a refresh of type DataOnly (which loads the data into a table but nothing more) was run in a single transaction for all four tables.
  • Finally, a refresh of type Calculate (which builds things like calculated tables, calculated columns, hierarchies and relationships) was run in a single transaction for all four tables. After this, all four tables were fully refreshed.

There are two important things to note about this:

  1. By running a refresh of type ClearValues first, to clear all the data out of the model, the memory consumption of the model is reduced to almost nothing. The downside of this is that the model is no longer queryable until the next two refreshes complete.
  2. Running two separate refreshes of type DataOnly and Calculate, rather than a single refresh of type Full, also reduces peak memory usage although it will also probably be slower.

For my second test I changed the definition of table C so that there was a 50% chance it returned an error when refreshed if an M parameter called FailRefresh was set to “Y”:

if FailRefresh = "Y" and Number.Random() > 0.5 then
  Function.InvokeAfter(
    () =>
      error Error.Record("Forced Refresh Failure"), 
    #duration(0, 0, 0, 30)
  )
else
  Function.InvokeAfter(
    () => #table(type table [MyCol = text], {{"C"}}), 
    #duration(0, 0, 0, 30)
  )

I then published the model again, changed my notebook code so that the retry_count parameter of refresh_dataset was set to 2 (see here for more detail about refresh retries), and ran the refresh until I got an instance where table C’s refresh failed the first time but succeeded on the retry. I observed the following:

  • First, a refresh of type ClearValues was run in a single transaction for all four tables as before.
  • Next, a refresh of type DataOnly was run in a single transaction for all four tables. Although the refreshes for tables A and B, which took 10 and 20 seconds, completed successfully the transaction failed because the refresh for table C returned an error after 30 seconds. The refresh for table D, which would have taken 40 seconds, did not complete.
  • Next, because retry_count was set to 2, the refresh of type DataOnly was run again for all four tables. This succeeded. Note that it did not try to rerun the ClearValues refresh again and that tables A and B were refreshed all over again.
  • Finally a refresh of type Calculate was run in a single transaction for all four tables, as before.

For my final test I kept everything the same except for setting the max_parallelism parameter to 1. I then refreshed until I once again found a case where table C’s refresh failed first time but succeeded on the retry. This time I observed the following:

  • First, a refresh of type ClearValues was run in a single transaction for all four tables as before except with a MaxParallelism of 1.
  • Next, a refresh of type DataOnly was run for just table A in its own transaction. It succeeded.
  • Next, a refresh of type DataOnly was run for just table B in its own transaction. It succeeded.
  • Next, a refresh of type DataOnly was run for just table C in its own transaction. It failed.
  • Next, a refresh of type DataOnly was run for just table C in its own transaction. It succeeded, and this represents the start of the retry.
  • Next, a refresh of type DataOnly was run for just table D in its own transaction. It succeeded.
  • Finally a refresh of type Calculate was run in a single transaction for all four tables, as before except with a MaxParallelism of 1.

This is interesting because, in contrast with the previous test, when table C failed and the retry started, tables A and B did not get refreshed a second time.

What can we learn from this?

  • The way the partialBatch commit mode splits up a refresh into three separate refreshes of type ClearValues, DataOnly and Calculate will reduce the overall peak memory consumption during a refresh, which is useful if your refresh is failing because you are using too much memory.
  • It will result in slower refresh performance overall and the model will not be queryable for most of the refresh, however.
  • If one of the tables in your model fails to refresh then reducing the amount of parallelism and setting a retry_count of more than 0 may mean that the tables that did refresh successfully first time around may not need to be refreshed again. Whether this results in a faster overall refresh though depends on whether not needing to refresh tables again counteracts the effect of less parallelism.

Overall I think using the partialBatch commit mode may be useful if you need to reduce memory usage during refresh or have very unreliable data sources. However if you need this type of fine grained control over your refreshes you’re probably better off writing more complex code (for example using a Fabric notebook as I did here) to control exactly what gets refreshed and how, rather than just making a single call to the Enhanced Refresh API using this option.

[Thanks to Jast Lu for answering my questions about this topic]

Extracting Power BI Import Mode Job Graph Data To A Table

In my last post but one I showed how you could create a DGML file that contains detailed information on what happens during a Power BI Import mode refresh – the “job graph” – and visualise that in Visual Studio or VS Code, to help with performance tuning. In my last post, I explained the concepts of blocking and waiting which are key to understanding this data. In this post I’ll share and describe a Dataflow template that extracts the data contained in the DGML file to a table for other types of analysis – which turns out to be quite simple, because DGML is an XML-based format which is easy to work with in Power Query and Dataflows.

You can download the .pqt template file here (for more information on how to use templates see here). To use it, create a new Dataflow Gen2 and import the template file. You’ll see the following queries:

Next, change the values of the three parameters to contain the IDs of the refresh whose DGML file you want to load (the notebook in this post that creates the DGML file uses the refresh ID as the file’s name) and the workspace and lakehouse where that file is stored. Finally, set output destinations on either the RefreshJobs and/or RefreshJobsPivoted queries. The RefreshJobs query returns a table with one row per refresh job:

This gives you all the data in the DGML file that was visible using a DGML viewer but in table form. For each job you get the CreatedAt, RunnableAt, StartedAt and FinishedAt datetimes, the Blocked, Wait and Running durations, the Slot ID, the CPU used and whether the job is on the critical path.

The RefreshJobsPivoted query gives you exactly the same data but instead of giving you one row per job, you get three: one row for the blocked, waiting and running phases of each job, with the start and end times for each phase and the duration. This may be a more convenient format for visualisation and analysis:

The Links query gives you the dependencies between the jobs:

Having loaded all this data into a Lakehouse you can now build a report on it. As I said, I’m sure there’s a way of visualising all of this data in a way that shows all the durations and dependencies, but even a basic table report on the RefreshJobs table like this is really useful because it shows you which jobs were blocked, which ones had to wait, and which ones took a long time to run:

In this example (the Dependencies example from my previous post, but where the refresh has maxParallelism set to 1, so there is blocking as well as waiting) you can see that refreshing table X took 60 seconds, that the refresh for table Y had to wait for 60 seconds before it could start and took 10 seconds, and that refresh for table XY was blocked for 70 seconds before it could start. So, if you wanted to make this refresh run faster, you would want to understand why table X was so slow and also look at increasing the amount of parallelism so Y did not have to wait.

Understanding Blocking And Waiting In Power BI Import Mode Refreshes

Following on from my previous post showing how you can visualise the job graph for a Power BI Import mode semantic model refresh, I this post I will look at how you can interpret what the job graph tells you – specifically, explaining the concepts of blocking and waiting. As always, simple examples are the best way of doing this.

Blocking

Blocking occurs when one job can only start when one or more other jobs have completed because there is a dependency between them. Consider the following semantic model:

X and Y are tables that contain a single numeric column. X takes 1 minute to load while Y takes 10 seconds to load (I forced this delay using technique I blogged about here). XYUnion is a DAX calculated table that unions all the rows from X and Y with the following definition:

XYUnion = UNION(X,Y)

As you can imagine, the job that refreshes XYUnion can only start once the jobs that refresh both X and Y have finished; XYUnion will be blocked until both X and Y have refreshed. Here’s what the job graph for the refresh of this semantic model looks like:

At the centre is the job “Process Partition XYUnion[XYUnion]” which refreshes the calculated table XYUnion. The arrows going from this job to the jobs “Process Partition X[X]” and “Process Partition Y[Y]”, which refresh tables X and Y, show that this job depends on those two other jobs.

Hovering over the “Process Partition XYUnion[XYUnion” job shows the following popup:

There are four datetime values here: CreatedAt, RunnableAt, StartedAt and FinishedAt. There are also three durations:

  • Block is the elapsed time between CreatedAt and RunnableAt, and is the amount of time elapsed before all the jobs this job depends on were completed. Anything other than a zero here means that blocking has occurred.
  • Wait is the elapsed time between RunnableAt and StartedAt. A job becomes runnable when all the jobs it depends on have completed but even it still may not be able to start because Power BI only allows a certain number of jobs to refresh in parallel (see this post for more details and how to control the amount of parallelism). Waiting is described in the example below.
  • Run is the elapsed time between StartedAt and FinishedAt and is the amount of time the job itself took to run.

In this case you can see that the value for Block for XYUnion is 1 minute: X and Y have no preceding jobs so they kick off at the same time, X takes 1 minute to run and Y takes 10 seconds, so it is 1 minute before XYUnion can run. The popups for X and Y show 1 minute and 10 seconds respectively for Run, as you would expect:

One last thing to mention: in the full job graph diagram above you’ll see that certain nodes are highlighted in red. That’s because they are on the critical path, which is documented here; it’s the chain of jobs that dictates the overall length of the refresh, so if you want to make your refresh faster then you need to tune the jobs on the critical path. In this case the critical path goes through X to XYUnion: if you wanted the whole semantic model to refresh faster you would need to tune the refresh for either X or XYUnion; tuning the refresh for Y would make no difference to the overall semantic model refresh time.

Waiting

As I already mentioned, there is a limit on the number of jobs within a refresh that can run in parallel. The maximum number of jobs is represented by the number of “slots” – one slot can only run one job at a time – and in the screenshots of the popups above you can see each job has a slot number. If there are more jobs that could be run than there are slots available at a given time then some jobs have to wait for a slot.

Here’s another example: a semantic model with three tables, A, B and C, which each take 30 seconds to refresh.

There are no dependencies between the tables so in theory each of these three tables could refresh in parallel. However if you refresh with maxParallelism set to two then only two of the three can refresh at any one time. Here’s the job graph for a refresh that does that:

As you can see the critical path goes through the refresh jobs for table A, and hovering over the “Process Partition A[A]” job shows the following:

While this job was not blocked at all because there are no jobs it depends on, it had to wait 30 seconds for a slot to become available; it eventually ran in slot 0. Hovering over the nodes for “Process Partition B[B]” and “Process Partition C[C]” shows that they ran in slots 0 and slot 1 respectively and neither of them were blocked or had to wait.

The job graph isn’t always the best way of visualising this type of parallelism; Phil Seamark’s original Power BI report for visualising refreshes has a page which shows the slots and the jobs in them but I think there’s probably a better way of visualising all of this data that shows slots as well as dependencies. Maybe a Deneb visual is the answer? If anyone has ideas I’d be interested to hear them! In any case, the first step to doing this is to extract all of this data from the .DGML file into a table and that’s what I’ll demonstrate how to do in the next post in this series.

Visualising Power BI Import Mode Refresh Job Graphs

A few years ago a new pair of Profiler events was added for Power BI Import mode datasets (and indeed AAS models): the Job Graph events. I blogged about them here but they never got used by anyone because it was extremely difficult to extract useful data from them – you had to run a Profiler trace, save the trace file, run a Python script to generate a .dgml file, then open that file in Visual Studio – which was a shame because they contain a lot of really interesting, useful information. The good news is that with the release of Semantic Link in Fabric and the ability to run Profiler traces from a Fabric notebook it’s now much easier to access Job Graph data and in this blog post I’ll show you how.

Quick recap: what are the Job Graph events and why are they useful? Let’s say you have a Power BI Import mode semantic model and you want to optimise refresh performance. When you refresh a semantic model, that refresh is made up of multiple jobs which themselves are made up of multiple jobs: refreshing a semantic model involves refreshing all the tables in that model, refreshing a table involves refreshing all the partitions in that table, refreshing a partition involves loading the data and building attribute hierarchies, and so on. Some of these jobs can happen in parallel but in some cases there are dependencies between jobs, so one job can only start when another has completed. The Job Graph events give you information on these refresh jobs and the dependencies between them so you can work out which jobs you need to optimise. In order to capture information from them you need to run a trace while the semantic model is being refreshed; the data from some of these Job Graph events can be reconstituted into a Directed Graph Markup Language (DGML) file, which is an XML-based format, and once you’ve got that you can either visualise the DGML file using a suitable viewer or extract the data from it and analyse it further.

[Before I carry on I have to acknowledge that I’m extremely new at Python and a lot of the code in this post is adapted from the code in my colleague Phil Seamark’s excellent recent post on visualising Power BI refresh information with Semantic Link. Any feedback on ways to optimise the code is gratefully received.]

Here’s some Python code that you can use in a Fabric notebook to run a refresh and generate a DGML file. Each code snippet can be used in a separate code cell or combined into a single cell.

First of all you need to install Semantic Link:

%pip install semantic-link

Next you need to define the events you want in your trace, which in this case are just the Job Graph events:

import sempy.fabric as fabric
import pandas as pd
import time
import warnings

base_cols = ["EventClass", "EventSubclass", "TextData", "IntegerData"]

# define events to trace and their corresponding columns

event_schema = {
"JobGraph": base_cols
}

warnings.filterwarnings("ignore")

You then need to start a trace using this definition, refresh the semantic model, stop the trace and filter the events captured so you only have those with the EventSubclass GraphFinished, remove the event which contains the metadata (which has a value of 0 in the IntegerData column) and then finally sort the rows in ascending order by the values in the IntegerData column:

WorkspaceName = "Insert workspace name here"
SemanticModelName = "Insert semantic model name here"

with fabric.create_trace_connection(SemanticModelName,WorkspaceName) as trace_connection:
# create trace on server with specified events
with trace_connection.create_trace(event_schema, "Simple Refresh Trace") as trace:

trace.start()

# run the refresh
request_status_id = fabric.refresh_dataset(SemanticModelName, WorkspaceName, refresh_type="full")
print("Progress:", end="")

while True:
status = fabric.get_refresh_execution_details(SemanticModelName, request_status_id, WorkspaceName).status
if status == "Completed":
break

print("░", end="")
time.sleep(2)

print(": refresh complete")
# allow ending events to collect
time.sleep(5)

# stop Trace and collect logs
final_trace_logs = trace.stop()



# only return GraphFinished events
final_trace_logs = final_trace_logs[final_trace_logs['Event Subclass'].isin(["GraphFinished"])]
# ignore metadata row
final_trace_logs = final_trace_logs[final_trace_logs['Integer Data'].ne(0)]
# sort in ascending order by Integer Data column
final_trace_logs = final_trace_logs.sort_values(by=['Integer Data'], ascending=True)

Finally, you need to take all the text from the EventText column of the remaining events and concatenate it to get the contents of the DGML file and then save that file to the Files section of the Lakehouse attached to your notebook:

# concatenate all text in TextData column
out = ''.join(final_trace_logs['Text Data'])
# change background colour of critical path nodes so it's easier to see in VS Code
out = out.replace("#263238", "#eba0a7")

# write dgml file
dgmlfile = open("/lakehouse/default/Files/" + request_status_id + ".dgml", 'x')
print (out, file=dgmlfile)
dgmlfile.close()

#dispose of trace connection
trace_connection.disconnect_and_dispose()

I found a nice Visual Studio Code extension called DGMLViewer which makes viewing DGML files easy. Rather than manually downloading the file, OneLake Explorer makes it easy to sync files in OneLake with your PC in a very similar way to OneDrive, which makes working with these DGML files in VS Code very straightforward because you can simply open the local copy when it syncs.

Here’s what one of thse DGML files, generated from the refresh of a very basic semantic model, looks like when viewed in DGML Viewer:

If you have Visual Studio you can also use it to view DGML files (you need to install the DGML Editor first); I found a VS extension called DgmlPowerTools 2022 which adds some advanced features. Here’s what a DGML file for a refresh looks like when visualised in Visual Studio 2022:

OK, so this looks cool but it also looks very complicated. What does it all mean? How can you interpret all this information and use it to optimise a refresh? That’s something for a future blog post!

[In my next post I look at how you can interpret this data and understand the concepts of blocking and waiting, and in the post after that show how you can extract the data in this DGML file to a table using a Dataflow]

Incremental Refresh On Delta Tables In Power BI

One of the coolest features in Fabric is Direct Lake mode, which allows you to build Power BI reports directly on top of Delta tables in your data lake without having to wait for a semantic model to refresh. However not everyone is ready for Fabric yet so there’s also a lot of interest in the new DeltaLake.Table M function which allows Power Query (in semantic models or dataflows) to read data from Delta tables. If you currently have a serving layer – for example Synapse Serverless or Databricks SQL Warehouse – in between your existing lake house and your import mode Power BI semantic models then this new function could allow you to remove it, to reduce complexity and cut costs. This will only be a good idea, though, if refresh performance isn’t impacted and incremental refresh can be made to work well.

So is it possible to get good performance from DeltaLake.Table with incremental refresh? Query folding isn’t possible using this connector because there’s no database to query: a Delta table is just a folder with some files in. But query folding isn’t necessary for incremental refresh to work well: what’s important is that when Power Query filters a table by the datetime column required for incremental refresh, that query is significantly faster than reading all the data from that table. And, as far as I can see from the testing I’ve done, because of certain performance optimisations within DeltaLake.Table it should be possible to use incremental refresh on a Delta table successfully.

There are three factors that influence the performance of Power Query when querying a Delta table:

  1. The internal structure of the Delta table, in particular whether it is partitioned or not
  2. The implementation of the connector, ie the DeltaLake.Table function
  3. The M code you write in the queries used to populate the tables in your semantic model

There’s not much you can do about #2 – performance is, I think, good enough right now although there are a lot of optimisations that will hopefully come in the future – but #1 and #3 are definitely within your control as a developer and making the right choices makes all the difference.

Here’s what I did to test incremental refresh performance. First, I used a Fabric pipeline to load the NYC Taxi sample data into a table in a Lakehouse (for the purposes of this exercise a Fabric Lakehouse will behave the same as ADLSgen2 storage – I used a Lakehouse because it was easier). Then, in Power BI Desktop, I created an import mode semantic model pointing to the NYC taxi data table in the Lakehouse and configured incremental refresh. Here’s the M code for that table:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/unpartitionednyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
#"Filtered Rows" = Table.SelectRows(
ToDelta,
each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd
)
in
#"Filtered Rows"

Here’s the incremental refresh dialog:

I then published the semantic model and refreshed it via the Enhanced Refresh API from a notebook (Semantic Link makes this so much easier) using an effective date of 8th December 2013 to get a good spread of data. I used Phil Seamark’s new, notebook-based version of his refresh visualisation tool to see how long each partition took during an initial refresh:

The refresh took just over 30 minutes.

Next, using Spark SQL, I created a copy of the NYC taxi data table in my Lakehouse with a new datetime column added which removed everything apart from the date and I then partitioned the table by that new datetime column (called PickupDate here):

CREATE TABLE PartitionedByDateNYC

USING delta
PARTITIONED BY (PickupDate)
AS
SELECT *, date_trunc("Day", lpepPickupDateTime) as PickupDate
FROM NYCIncrementalRefreshTest.nyctaxi_raw

I created a copy of my semantic model, pointed it to the new table and reconfigured the incremental refresh to filter on the newly-created PickupDate column:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionedbydatenyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
#"Filtered Rows" = Table.SelectRows(
ToDelta,
each [PickupDate] >= RangeStart and [PickupDate] < RangeEnd
)
in
#"Filtered Rows"

…and refreshed again. This time the refresh took about 26 seconds.

Half an hour to 26 seconds is a big improvement and it’s because the DeltaLake.Table function is able to perform partition elimination: the partitions in the semantic model align to one or more partitions in the Delta table, so when each partition in the semantic model is refreshed Power Query only needs to read data from the partitions in the Delta table that contain the relevant data. This only happens because the filter in the Power Query query using the RangeStart and RangeEnd parameters is on the same column that is used to partition the Delta table.

In my final test I partitioned my Delta table by month, like so:

CREATE TABLE PartitionedNYC

USING delta
PARTITIONED BY (PickupYearMonth)
AS
SELECT *, (100*date_part('YEAR', lpepPickupDateTime)) + date_part('Months', lpepPickupDateTime) as PickupYearMonth
FROM NYCIncrementalRefreshTest.nyctaxi_raw

The challenge here is that:

  1. The new PickupYearMonth column is an integer column, not a datetime column, so it can’t be used for an incremental refresh filter in Power Query
  2. Power BI incremental refresh creates partitions at the year, quarter, month and date granularities, so filtering by month can’t be used for date partitions

I solved this problem in my Power Query query by calculating the month from the RangeStart and RangeEnd parameters, filtering the table by the PickupYearMonth column (to get partition elimination), stopping any further folding using the Table.StopFolding function and then finally filtering on the same datetime column I used in my first test:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionednyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
YearMonthRangeStart = (Date.Year(RangeStart) * 100) + Date.Month(RangeStart),
YearMonthRangeEnd = (Date.Year(RangeEnd) * 100) + Date.Month(RangeEnd),
FilterByPartition = Table.StopFolding(
Table.SelectRows(
ToDelta,
each [PickupYearMonth] >= YearMonthRangeStart and [PickupYearMonth] <= YearMonthRangeEnd
)
),
#"Filtered Rows" = Table.SelectRows(
FilterByPartition,
each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd
)
in
#"Filtered Rows"

Interestingly this table refreshed even faster: it took only 18 seconds.

This might just be luck, or it could be because the larger partitions resulted in fewer calls back to the storage layer. The AzureStorage.DataLake M function requests data 4MB at a time by default and this could result in more efficient data retrieval for the data volumes used in this test. I didn’t get round to testing if using non-default options on AzureStorage.DataLake improved performance even more (see here for more details on earlier testing I did with them).

To sum up, based on these tests it looks like incremental refresh can be used effectively in import mode semantic models with Delta tables and the DeltaLake.Table function so long as you partition your Delta table and configure your Power Query queries to filter on the partition column. I would love to hear what results you get if you test this in the real world so please let me know by leaving a comment.