Visualising Power BI Import Mode Refresh Job Graphs

A few years ago a new pair of Profiler events was added for Power BI Import mode datasets (and indeed AAS models): the Job Graph events. I blogged about them here but they never got used by anyone because it was extremely difficult to extract useful data from them – you had to run a Profiler trace, save the trace file, run a Python script to generate a .dgml file, then open that file in Visual Studio – which was a shame because they contain a lot of really interesting, useful information. The good news is that with the release of Semantic Link in Fabric and the ability to run Profiler traces from a Fabric notebook it’s now much easier to access Job Graph data and in this blog post I’ll show you how.

Quick recap: what are the Job Graph events and why are they useful? Let’s say you have a Power BI Import mode semantic model and you want to optimise refresh performance. When you refresh a semantic model, that refresh is made up of multiple jobs which themselves are made up of multiple jobs: refreshing a semantic model involves refreshing all the tables in that model, refreshing a table involves refreshing all the partitions in that table, refreshing a partition involves loading the data and building attribute hierarchies, and so on. Some of these jobs can happen in parallel but in some cases there are dependencies between jobs, so one job can only start when another has completed. The Job Graph events give you information on these refresh jobs and the dependencies between them so you can work out which jobs you need to optimise. In order to capture information from them you need to run a trace while the semantic model is being refreshed; the data from some of these Job Graph events can be reconstituted into a Directed Graph Markup Language (DGML) file, which is an XML-based format, and once you’ve got that you can either visualise the DGML file using a suitable viewer or extract the data from it and analyse it further.

[Before I carry on I have to acknowledge that I’m extremely new at Python and a lot of the code in this post is adapted from the code in my colleague Phil Seamark’s excellent recent post on visualising Power BI refresh information with Semantic Link. Any feedback on ways to optimise the code is gratefully received.]

Here’s some Python code that you can use in a Fabric notebook to run a refresh and generate a DGML file. Each code snippet can be used in a separate code cell or combined into a single cell.

First of all you need to install Semantic Link:

%pip install semantic-link

Next you need to define the events you want in your trace, which in this case are just the Job Graph events:

import sempy.fabric as fabric
import pandas as pd
import time
import warnings

base_cols = ["EventClass", "EventSubclass", "TextData", "IntegerData"]

# define events to trace and their corresponding columns

event_schema = {
"JobGraph": base_cols
}

warnings.filterwarnings("ignore")

You then need to start a trace using this definition, refresh the semantic model, stop the trace and filter the events captured so you only have those with the EventSubclass GraphFinished, remove the event which contains the metadata (which has a value of 0 in the IntegerData column) and then finally sort the rows in ascending order by the values in the IntegerData column:

WorkspaceName = "Insert workspace name here"
SemanticModelName = "Insert semantic model name here"

with fabric.create_trace_connection(SemanticModelName,WorkspaceName) as trace_connection:
# create trace on server with specified events
with trace_connection.create_trace(event_schema, "Simple Refresh Trace") as trace:

trace.start()

# run the refresh
request_status_id = fabric.refresh_dataset(SemanticModelName, WorkspaceName, refresh_type="full")
print("Progress:", end="")

while True:
status = fabric.get_refresh_execution_details(SemanticModelName, request_status_id, WorkspaceName).status
if status == "Completed":
break

print("░", end="")
time.sleep(2)

print(": refresh complete")
# allow ending events to collect
time.sleep(5)

# stop Trace and collect logs
final_trace_logs = trace.stop()



# only return GraphFinished events
final_trace_logs = final_trace_logs[final_trace_logs['Event Subclass'].isin(["GraphFinished"])]
# ignore metadata row
final_trace_logs = final_trace_logs[final_trace_logs['Integer Data'].ne(0)]
# sort in ascending order by Integer Data column
final_trace_logs = final_trace_logs.sort_values(by=['Integer Data'], ascending=True)

Finally, you need to take all the text from the EventText column of the remaining events and concatenate it to get the contents of the DGML file and then save that file to the Files section of the Lakehouse attached to your notebook:

# concatenate all text in TextData column
out = ''.join(final_trace_logs['Text Data'])
# change background colour of critical path nodes so it's easier to see in VS Code
out = out.replace("#263238", "#eba0a7")

# write dgml file
dgmlfile = open("/lakehouse/default/Files/" + request_status_id + ".dgml", 'x')
print (out, file=dgmlfile)
dgmlfile.close()

#dispose of trace connection
trace_connection.disconnect_and_dispose()

I found a nice Visual Studio Code extension called DGMLViewer which makes viewing DGML files easy. Rather than manually downloading the file, OneLake Explorer makes it easy to sync files in OneLake with your PC in a very similar way to OneDrive, which makes working with these DGML files in VS Code very straightforward because you can simply open the local copy when it syncs.

Here’s what one of thse DGML files, generated from the refresh of a very basic semantic model, looks like when viewed in DGML Viewer:

If you have Visual Studio you can also use it to view DGML files (you need to install the DGML Editor first); I found a VS extension called DgmlPowerTools 2022 which adds some advanced features. Here’s what a DGML file for a refresh looks like when visualised in Visual Studio 2022:

OK, so this looks cool but it also looks very complicated. What does it all mean? How can you interpret all this information and use it to optimise a refresh? That’s something for a future blog post!

[In my next post I look at how you can interpret this data and understand the concepts of blocking and waiting, and in the post after that show how you can extract the data in this DGML file to a table using a Dataflow]

Exploring Power BI Run-Length Encoding With DMVs

Recently I was involved in an interesting discussion on Twitter X about how partitioning a table in a Power BI dataset semantic model can affect compression and therefore its size and query performance. You can read the thread here. It got me thinking: is there a way to get more detail on how well compression, and in particular run-length encoding (RLE from hereon), is working for a column when you’re using Import mode or Direct Lake? After a bit of research I found out there is, so let’s see some examples that illustrate what I learned.

First of all, consider the following M query that returns a table with one numeric column called MyNumbers:

let
    Source = {1..DistinctValues},
    Repeat = List.Transform(Source, each List.Repeat({_}, TotalRows/DistinctValues)),
    Combine = List.Combine(Repeat),
    #"Converted to Table" = Table.FromList(Combine, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "MyNumbers"}}),
    #"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"MyNumbers", Int64.Type}})
in
    #"Changed Type"

It references two M parameters: TotalRows, which determines the total number of rows in the table, and DistinctValues, which determines the number of distinct numeric values in the MyNumbers column. With TotalRows=9 and DistinctValues=3, it returns the following output:

Note that in this case it returns 3 rows with the value 1 repeated, 3 rows with the value 2 repeated and 3 three rows with the value 3 repeated; three sets of repeated values in all. It’s fair to assume that repeated sets of values like this are a good candidate for RLE.

I created a semantic model in Import mode containing only this table and published it to the Power BI Service. Initially TotalRows was set to 1,000,000 and DistinctValues was set to 100 – so the table consisted of just 100 sets of 10,000 repeated values. I chose 1,000,000 rows because that’s the size of a segment in the Power BI Service with the “small semantic model format” setting and any compression that takes place always takes place within a segment.

When the Analysis Services engine inside Power BI compresses data it looks for sequences of repeated values to see if RLE can be used. If it finds them, these sequences result in “pure” RLE runs; if it doesn’t find these sequences they are called “impure” RLE runs and the values are stored using bitpack compression. Pure runs are generally a good thing, impure runs generally a bad thing. You can see how many pure and impure runs there are using the TMSCHEMA_COLUMN_STORAGES DMV, for example with the following DMV query:

select 
[Name], Statistics_DistinctStates, Statistics_RowCount, 
Statistics_RLERuns, Statistics_OthersRLERuns 
from $SYSTEM.TMSCHEMA_COLUMN_STORAGES

Running this query in DAX Studio on my published semantic model returned the following table:

[You can ignore all the rows except the one for the MyNumbers column in this table]

The Statistics_RLERuns column shows the number of pure RLE runs; the Statistics_OthersRLERuns column shows the number of impure RLE runs. In this case you can see, for the MyNumbers column, there were 100 pure RLE runs and no impure runs, so as expected RLE is working well.

Here’s what Vertipaq Analyzer showed for this table:

Unsurprisingly the size of the MyNumbers column is very small.

Then I changed DistinctValues to 100,000 (keeping TotalRows at 1,000,000), giving me 100,000 sets of 10 values, and refreshed the dataset. Here’s what the DMV query on TMSCHEMA_COLUMN_STORAGES returned:

And here’s what Vertipaq Analyzer showed:

As you can see, the column was a lot larger than before; there were no pure RLE runs and one impure RLE run. In this case the large number of distinct values in the column prevented RLE from taking place and this had a negative impact on the size of the column.

These are two extreme cases. What about a scenario that’s somewhere in between? I modified my M query as follows:

let  
    RepeatedNumbers = 
    let
    Source = {1..DistinctValues},
    Repeat = List.Transform(Source, each List.Repeat({_}, ((TotalRows/2)/DistinctValues))),
    Combine = List.Combine(Repeat),
    #"Converted to Table" = Table.FromList(Combine, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "MyNumbers"}}),
    #"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"MyNumbers", Int64.Type}})
in
    #"Changed Type",

    RandomNumbers = 
    let
    Source = {1..TotalRows/2},
    #"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Added Custom" = Table.AddColumn(#"Converted to Table", "MyNumbers", each Number.Round(Number.RandomBetween(TotalRows+1, TotalRows*2))),
    #"Removed Other Columns" = Table.SelectColumns(#"Added Custom",{"MyNumbers"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Removed Other Columns",{{"MyNumbers", Int64.Type}})
in
    #"Changed Type",

    Output = Table.Combine({RepeatedNumbers, RandomNumbers})
in
    Output

What this version of the code does is return a table where the first 50% of the rows are repeated numbers and the second 50% are random numbers. With TotalRows set to 12 and DistinctValues set to 2 it produces the following output:

With this version published to the Power BI Service I set TotalRows to 1,000,000 again and set DistinctValues to 2000, resulting in a table with 2000 sets of 250 repeating values followed by 500,000 random values. Here’s what the DMV query against TMSCHEMA_COLUMN_STORAGES returned:

As you can see there are now 2000 pure runs (I assume for the first 50% of rows with repeated values) and 1 impure run (I assume for the second 50% of rows with random values).

Here’s the output of Vertipaq Analyzer:

The column is now almost as large as in the second scenario above.

You can get a bit more detail about what’s happening in the impure runs with the DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS DMV. Running the following query against the latest version of the table:

select 
column_ID, partition_name, segment_number,
records_count, bits_count, used_size
from $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS

…returns the following values:

To get a rough idea of the number of rows in the impure runs you can use the following formula:

(used_size * 8)/bits_count

In this case for the MyNumbers column (1349720 * 8)/21 = 514,179 which makes sense since my code returns 500,000 rows of random numbers. The records_count column in this query returns the total number of rows in the segment, so the higher the numberof rows in impure runs relative to the total, the worse compression you’re getting.

What practical use is this information? Probably not much as you might think, interesting as it is. It can tell you how well RLE is working for a column but it doesn’t tell you much about how to optimise it, or if it is possible to optimise it, or if optimising it is a good idea – that’s a subject for another blog post.

[Thanks to Marius Dumitru and Akshai Mirchandani for the information in this post]