Reading Parquet Metadata In Power Query In Power BI

There’s a new M function in Power Query in Power BI that allows you to read the data from a Parquet file: Parquet.Metadata. It’s not documented yet and it’s currently marked as “intended for internal use only” but I’ve been told I can blog about it. Here’s an example of how to use it:

let
Source = Parquet.Metadata(File.Contents("C:\myfile.snappy.parquet"))
in
Source

…and here’s an example of the output:

This query shows how to expand the record returned by this function into a table:

let
m = Parquet.Metadata(File.Contents("C:\myfile.snappy.parquet")),
schema = List.Accumulate(Table.ToRecords(m[Schema]), [], (x, y) => if y[NumChildren] = null then Record.AddField(x, y[Name], y[LogicalType] ?? y[ConvertedType]) else x),
expanded1 = Table.ExpandTableColumn(m[RowGroups], "Columns", {"MetaData"}),
renamed1 = Table.RenameColumns(expanded1, {{"Ordinal", "RowGroup"}, {"TotalCompressedSize", "RowGroupCompressedSize"}, {"TotalByteSize", "RowGroupSize"}}),
expanded2 = Table.ExpandRecordColumn(renamed1, "MetaData", {"Type", "Encodings", "PathInSchema", "Codec", "NumValues", "TotalUncompressedSize", "TotalCompressedSize", "KeyValueMetadata", "DataPageOffset", "IndexPageOffset", "DictionaryPageOffset", "Statistics", "EncodingStats"}),
renamed2 = Table.RenameColumns(expanded2, {{"Type", "PhysicalType"}}),
added1 = Table.AddColumn(renamed2, "Column", each Text.Combine([PathInSchema])),
added2 = Table.AddColumn(added1, "Cardinality", each [Statistics][DistinctCount]),
added3 = Table.AddColumn(added2, "NullCount", each [Statistics][NullCount]),
added4 = Table.AddColumn(added3, "DictionarySize", each [DataPageOffset] - [DictionaryPageOffset]),
added5 = Table.AddColumn(added4, "LogicalType", each Record.FieldOrDefault(schema, [Column], null)),
selected = Table.SelectColumns(added5, {"RowGroup", "Column", "Codec", "NumValues", "Cardinality", "NullCount", "TotalCompressedSize", "TotalUncompressedSize", "DictionarySize", "PhysicalType", "LogicalType"})
in
selected

As you can see this gives you all kinds of useful information about a Parquet file such as the schema, the compression type used, column cardinality and so on.

[Thanks to Curt Hagenlocher for the tip-off and the query above]

Performance Implications Of Using Calculated Columns In Composite Models On Power BI Semantic Models

I don’t have anything against the use of calculated columns in Power BI semantic models in general but you do need to be careful using them with DirectQuery mode. In particular when you have a DirectQuery connection to another Power BI semantic model – also known as a composite model on a Power BI semantic model – it’s very easy to cause serious performance problems with calculated columns. Let’s see a simple example of why this is.

Let’s say you have an Import mode semantic model called Source Model containing sales data:

Here’s the contents of the Sales table:

And here’s the definition of the Sales Amount measure:

Sales Amount = SUM(Sales[Sales])

Now, let’s say this semantic model gets published to the Service and you open Power BI Desktop, create a Live connection to this model and then click the “Make changes to this model” button to create a composite model. You now have a local composite model in Power BI Desktop that is connected to the Source Model semantic model.

Next, you decide to create a calculated column on the Sales table and after reading Marco and Alberto’s article on circular dependencies (because you’ll definitely run into circular dependency errors) come up with the following code:

Tax =
CALCULATE (
[Sales Amount] * 0.1,
ALLEXCEPT ( Sales, Sales[TransactionNumber] )
)

Great! You can now build a report that looks like this:

This is where the first warning comes though. In Import mode semantic models calculated columns are evaluated when the model is refreshed, but in DirectQuery mode they are evaluated at query time if the query needs to use them. Performance Analyzer shows two DAX queries generated for the visual above. The first is the query that the visual runs against the local semantic model in Power BI Desktop, which looks like this:

DEFINE
VAR __DS0Core =
SUMMARIZECOLUMNS(
ROLLUPADDISSUBTOTAL(ROLLUPGROUP('Country'[Country], 'Product'[Product]), "IsGrandTotalRowTotal"),
"Sales_Amount", 'Sales'[Sales Amount],
"SumTax", CALCULATE(SUM('Sales'[Tax]))
)

VAR __DS0PrimaryWindowed =
TOPN(502, __DS0Core, [IsGrandTotalRowTotal], 0, 'Country'[Country], 1, 'Product'[Product], 1)

EVALUATE
__DS0PrimaryWindowed
ORDER BY
[IsGrandTotalRowTotal] DESC, 'Country'[Country], 'Product'[Product]

There’s nothing much interesting here. However, there’s a second DAX query generated: the one generated by the local model to get the data it needs from the Source Model semantic model in the Service. This query looks like this:

Define
COLUMN 'Sales'[ASDQ_Tax] = CALCULATE([Sales Amount] * 0.1, ALLEXCEPT(Sales, Sales[TransactionNumber]))

var ASDQ___DS0Core = SUMMARIZECOLUMNS( ROLLUPADDISSUBTOTAL(ROLLUPGROUP('Country'[Country], 'Product'[Product]), "IsGrandTotalRowTotal"),
"Sales_Amount", [Sales Amount],
"SumTax", CALCULATE(SUM('Sales'[ASDQ_Tax]))
)
var ASDQ___DS0PrimaryWindowed = TOPN(502, ASDQ___DS0Core, [IsGrandTotalRowTotal], 0, 'Country'[Country], 1, 'Product'[Product], 1)

EVALUATE
ASDQ___DS0PrimaryWindowed
ORDER BY
[IsGrandTotalRowTotal] DESC, 'Country'[Country], 'Product'[Product]

Notice, right at the top of the query, a DEFINE statement that defines the Tax calculated column. Every time this visual is rendered in your Power BI report the calculated column will be re-evaluated and that will have a performance overhead. If the calculated column had been created in Source Model (which is Import mode remember) it would have been evaluated when the model was refreshed and any queries that used it would probably be faster as a result.

There’s something else to watch out for though. Let’s say you define a second calculated column in the local model like so:

Apple Sales =
CALCULATE (
[Sales Amount],
FILTER ( ALLNOBLANKROW ( 'Sales'[Product] ), 'Sales'[Product] = "Apples" )
)

You can now create a table that looks like this, which does include the Apple Sales column but does not include the Tax column:

The DAX query sent by the visual to the local model is not worth looking at. However, the DAX query sent by the local model to Source Model does have something surprising in it:

Define
COLUMN 'Sales'[ASDQ_Apple Sales] = CALCULATE([Sales Amount], FILTER(ALLNOBLANKROW('Sales'[Product]), 'Sales'[Product]="Apples"))

COLUMN 'Sales'[ASDQ_Tax] = CALCULATE([Sales Amount] * 0.1, ALLEXCEPT(Sales, Sales[TransactionNumber]))

var ASDQ___DS0Core = SUMMARIZECOLUMNS( ROLLUPADDISSUBTOTAL(ROLLUPGROUP('Country'[Country], 'Product'[Product]), "IsGrandTotalRowTotal"),
"Sales_Amount", [Sales Amount],
"SumApple_Sales", CALCULATE(SUM('Sales'[ASDQ_Apple Sales]))
)

var ASDQ___DS0PrimaryWindowed = TOPN(502, ASDQ___DS0Core, [IsGrandTotalRowTotal], 0, 'Country'[Country], 1, 'Product'[Product], 1)

EVALUATE
ASDQ___DS0PrimaryWindowed
ORDER BY
[IsGrandTotalRowTotal] DESC, 'Country'[Country], 'Product'[Product]

There’s a DEFINE statement for the Apple Sales column as you would expect. However there is also a DEFINE statement for the Tax column, which means it is evaluated too even though it’s not shown in the visual and it seems as though the Apple Sales column doesn’t reference it – or does it? I’m not going to try to explain what’s going on here (again, this is best left to Marco and Alberto) but the important point is that both calculated columns are now being evaluated at query time which means the query will be even slower. If you have large data volumes and/or your calculated columns contain some complex DAX you could end up with some very slow reports. If you’re using Premium then you could also be putting a lot of extra load on your capacity.

Composite models are an important part of Power BI’s self-service story and it’s inevitable that at some point your developers will want to use build calculated columns in them. As always you need to monitor the performance of your reports carefully, both in terms of duration and CPU usage, while you are building them to make sure they are well optimised and keep the use of calculated columns in composite models to a minimum.

Learning Power BI And Fabric By Attending User Groups And Conferences

One of the most popular questions on the forums I follow is “How can I learn Power BI and Fabric?”. There are a ton of great, free online resources out there – blogs, videos and so on – but the people answering this question often do not mention user groups and conferences. I think that’s a mistake because, at least for me, attending (and indeed speaking at) online and in-person events is one of the most effective ways of learning.

Why is this? It’s not about the content, I think: a lot of what is presented at these meetings is freely available elsewhere already. Indeed, a lot of user groups and conferences record their sessions now and make them available for free online afterwards (eg SQLBits). What’s different is that an event, especially an in-person event, demands your attention more because it takes place at a specific place and time. You can’t pause an event because you’ve been distracted by an important work email or because you’ve seen a cat video that you just have to share, which means you’re more likely to concentrate on it than on a blog post or a video. That’s even more true for an in-person event where, if you’ve made the effort to travel to a venue and you’re sitting in an audience, there’s added social pressure to pay attention to what the speaker is saying. And trust me, I need all the help I can get avoiding distractions these days.

The other benefit of attending an event is the ability to chat with, and ask questions to, the presenter, other attendees and sponsors. Every online event I’ve been to has had a very active chat, and of course with in-person events there’s the ability to queue up and ask the speaker questions or ask for a selfie. I know a lot of people who have found new jobs as a result of attending conferences and user groups. In-person events very often have social activities which make it easy to meet people as well; again, SQLBits is a great example of this with its legendary party and other activities such as runs and the pub quiz.

User groups are almost always free and in many cases there is free pizza too. Most conferences are paid although some have a free day (in the UK SQLBits is free on Saturdays and Data Relay is an annual series of free one-day events). Even when you have to pay I think they represent very good value for money compared to other things you could spend you training budget on.

If I’ve convinced you to attend an event, the next question is: how do I find one near me? Meetup is the best way to find user groups and searching for “Power BI” or “Fabric” and your location there will give you a list of upcoming events you could attend. In the UK there are very active in-person user groups in London, Manchester (I was there last week and they had over 100 attendees), Birmingham, Leeds and Newcastle. There are also several UK groups that are wholly or partly online, such as Oxford, Devon & Cornwall and the Fabric UK User Group.

I don’t think there’s a single list of all the Power BI or Fabric conferences anywhere but here are a few coming up soon that I plan to attend or know are good:

  • The Microsoft Fabric Community Conference in Las Vegas, USA, March 26-28 2024. There will be a lot of Microsoft folks presenting and a lot of big announcements there!
  • Fabric February in Oslo, Norway, February 7-9 2024 – although I think it’s just sold out.
  • Power BI Cruise from Stockholm to Helsinki, June 3-5 2024. I really need to do this one year.
  • Power BI Next Step in Aarhus, Denmark, September 12-13 2024. Those Scandinavians do love the Microsoft data platform.
  • The Global Excel Summit in London, UK, February 7-9. Despite the name there’s a lot of Power BI content here too.
  • SQLDay in Wroclaw, Poland, 13-15 May 2024.
  • DATA:Scotland in Glasgow, UK, 13 September 2024.
  • SQLKonferenz in Hanau, Germany, 30 September-2 October.
  • SQLBits in Farnborough, UK, 19-23 March 2024. The best Microsoft data-related conference in the world as far as I’m concerned – I never miss it – and by far the largest in Europe too.

Apologies if I’ve missed your favourite event, there are too many out there to list or even remember…

So what are you waiting for? If you’re serious about learning Power BI and Fabric get yourself along to one of these events!

Understanding The “Evaluation resulted in a stack overflow” Error In Power Query In Excel And Power BI

If you’re writing your own M code in Power Query in Power BI or Excel you may run into the following error:

Expression.Error: Evaluation resulted in a stack overflow and cannot continue.

If you’re a programmer you probably know what a stack overflow is; if you’re not you might search for the term, find this Wikipedia article and still have no clue what has happened. Either way it may still be difficult to understand why you’re running into it in Power Query. To explain let’s see some examples.

First, a really simple one. You can write recursive functions – functions that call themselves – in Power Query, although as you might suspect that is generally a bad idea because they are difficult to write, difficult to understand, slow and may result in the error above. Consider the following function called MyRecursiveFunction which references a parameter of type number called MaxDepth:

(counter as number) as number=>
if counter>MaxDepth then counter else @MyRecursiveFunction(counter+1)

The function takes a number and calls itself, passing in a value one greater than the number that was passed to it, until the number passed is greater than MaxDepth. So, if the value of MaxDepth is 3 and you call the function and pass it the value of 1, like so:

MyRecursiveFunction(1)

…then you’ll get the value 4 back:

So far so good. But how long can a function in M go on calling itself? As you can imagine, it’s not forever. So, when you hit the point where the function can’t go on calling itself then you get the error above. For example if you try setting MaxDepth to 100000 then you’ll get the stack overflow error above instead of 100001:

As a result it’s almost always a good idea to avoid recursion in Power Query and use functions like List.Transform, List.Generate or List.Accumulate to achieve the same result. A great example of this is shown in the Power Query custom connector documentation in the section on handling APIs that return results broken up into pages with the Table.GenerateByPage code sample.

You may still get this error even when you’re not explicitly using recursion though, as a result of lazy evaluation. Consider the following query which uses List.Accumulate to generate a table with a given number of rows:

let
//define a table with one row and one column called x
MyTable = #table(type table [x = number], {{1}}),
//specify how many rows we want in our output table
NumberOfTimes = 3,
//Use List.Accumulate to create a table with this number of rows
//By calling Table.Combine
CombineAllTables = List.Accumulate(
{1 .. NumberOfTimes},
null,
(state, current) => if current = 1 then MyTable else Table.Combine({state, MyTable})
)
in
CombineAllTables

Here’s the output, a table with three rows:

But how does it get this result? With NumberOfTimes=3 you can think of this query lazily building up an M expression something like this:

Table.Combine({Table.Combine({MyTable, MyTable}), MyTable})

…which, once List.Accumulate has finished, suddenly all has to be evaluated and turned into a single table. Imagine how much nesting of Table.Combine there would be if NumberOfTimes was a much larger number though! And indeed, it turns out that you can’t make lots and lots of calls to Table.Combine without running into a stack overflow. So if NumberOfTimes=100000 like so:

let

//define a table with one row and one column called x
MyTable = #table(type table [x = number], {{1}}),
//specify how many rows we want in our output table
NumberOfTimes = 100000,
//Use List.Accumulate to create a table with this number of rows
//by calling Table.Combine
CombineAllTables = List.Accumulate(
{1 .. NumberOfTimes},
null,
(state, current) => if current = 1 then MyTable else Table.Combine({state, MyTable})
)
in
CombineAllTables

…then, after a minute or so, you get the “Evaluation resulted in a stack overflow and cannot continue” error again.

Rewriting the query so you build up the list of tables first and only call Table.Combine once at the end avoids the problem and is much faster:

let
//define a table with one row and one column called x
MyTable = #table(type table [x = number], {{1}}),
//specify how many rows we want in our output table
NumberOfTimes = 100000,
//create a table with NumberOfTimes rows
CombineAllTables = Table.Combine(List.Repeat({MyTable}, NumberOfTimes))
in
CombineAllTables

It’s also possible to solve the problem by forcing eager evaluation inside List.Accumulate but this is extremely tricky: there’s an example of this on Gil Raviv’s blog here.

New Limits For The “Maximum Connections Per Data Source” Property In Power BI DirectQuery Mode

One of the most important properties you can set in a Power BI DirectQuery semantic model is the “Maximum connections per data source” property, which controls the number of connections that can be used to run queries against a data source. The good news is that the maximum value that you can set this property to has just been increased in Premium.

This property is important because the number of open connections to a data source acts as a limit on the number of concurrent queries that can be run by the semantic model against the source: each connection can only have one query running on it at any one time. If you have Power BI reports that have a large number of visuals on a page and/or a large number of users running reports at the same time then it is very easy for a DirectQuery semantic model to need to send lots of queries back to your data source at the same time. If some of the queries that your semantic model runs against your data source are slow – more than one or two seconds even – then the number of queries that need to be run at a given time will increase. The same is true if you have increased the Max Parallelism Per Query property to increase the number of parallel queries that can be generated by a single DAX query.

This property is documented in a number of places, including the DirectQuery guidance documentation and in data source-specific best practice documents such as this one for Snowflake. You can set the property in Power BI Desktop in the Current File/DirectQuery section of the Options dialog:

If you are not using Power BI Premium (ie you are using Power BI Shared capacity, also known as Power BI Pro) then the maximum value that you can set this property to is 10. If you are using Power BI Premium then the maximum value up to today was 30 but now that limit has been increased. The table on this page shows what the new limits per SKU are:

As you can see, for a P1/F64 the maximum limit is now 50 rather than 30 and this limit goes all the way up to 200 for a P4/F512 and higher.

I’ve seen plenty of cases where increasing the value of this property makes Power BI reports run a lot faster. However, this will only be true if your data source is able to handle the number of queries that Power BI is trying to run against it. As I showed in this post, if your data source can’t handle the number of queries you’re trying to run then performance will get worse and not better, so you should try different values to see which one works best.

Extracting Power BI Import Mode Job Graph Data To A Table

In my last post but one I showed how you could create a DGML file that contains detailed information on what happens during a Power BI Import mode refresh – the “job graph” – and visualise that in Visual Studio or VS Code, to help with performance tuning. In my last post, I explained the concepts of blocking and waiting which are key to understanding this data. In this post I’ll share and describe a Dataflow template that extracts the data contained in the DGML file to a table for other types of analysis – which turns out to be quite simple, because DGML is an XML-based format which is easy to work with in Power Query and Dataflows.

You can download the .pqt template file here (for more information on how to use templates see here). To use it, create a new Dataflow Gen2 and import the template file. You’ll see the following queries:

Next, change the values of the three parameters to contain the IDs of the refresh whose DGML file you want to load (the notebook in this post that creates the DGML file uses the refresh ID as the file’s name) and the workspace and lakehouse where that file is stored. Finally, set output destinations on either the RefreshJobs and/or RefreshJobsPivoted queries. The RefreshJobs query returns a table with one row per refresh job:

This gives you all the data in the DGML file that was visible using a DGML viewer but in table form. For each job you get the CreatedAt, RunnableAt, StartedAt and FinishedAt datetimes, the Blocked, Wait and Running durations, the Slot ID, the CPU used and whether the job is on the critical path.

The RefreshJobsPivoted query gives you exactly the same data but instead of giving you one row per job, you get three: one row for the blocked, waiting and running phases of each job, with the start and end times for each phase and the duration. This may be a more convenient format for visualisation and analysis:

The Links query gives you the dependencies between the jobs:

Having loaded all this data into a Lakehouse you can now build a report on it. As I said, I’m sure there’s a way of visualising all of this data in a way that shows all the durations and dependencies, but even a basic table report on the RefreshJobs table like this is really useful because it shows you which jobs were blocked, which ones had to wait, and which ones took a long time to run:

In this example (the Dependencies example from my previous post, but where the refresh has maxParallelism set to 1, so there is blocking as well as waiting) you can see that refreshing table X took 60 seconds, that the refresh for table Y had to wait for 60 seconds before it could start and took 10 seconds, and that refresh for table XY was blocked for 70 seconds before it could start. So, if you wanted to make this refresh run faster, you would want to understand why table X was so slow and also look at increasing the amount of parallelism so Y did not have to wait.

Understanding Blocking And Waiting In Power BI Import Mode Refreshes

Following on from my previous post showing how you can visualise the job graph for a Power BI Import mode semantic model refresh, I this post I will look at how you can interpret what the job graph tells you – specifically, explaining the concepts of blocking and waiting. As always, simple examples are the best way of doing this.

Blocking

Blocking occurs when one job can only start when one or more other jobs have completed because there is a dependency between them. Consider the following semantic model:

X and Y are tables that contain a single numeric column. X takes 1 minute to load while Y takes 10 seconds to load (I forced this delay using technique I blogged about here). XYUnion is a DAX calculated table that unions all the rows from X and Y with the following definition:

XYUnion = UNION(X,Y)

As you can imagine, the job that refreshes XYUnion can only start once the jobs that refresh both X and Y have finished; XYUnion will be blocked until both X and Y have refreshed. Here’s what the job graph for the refresh of this semantic model looks like:

At the centre is the job “Process Partition XYUnion[XYUnion]” which refreshes the calculated table XYUnion. The arrows going from this job to the jobs “Process Partition X[X]” and “Process Partition Y[Y]”, which refresh tables X and Y, show that this job depends on those two other jobs.

Hovering over the “Process Partition XYUnion[XYUnion” job shows the following popup:

There are four datetime values here: CreatedAt, RunnableAt, StartedAt and FinishedAt. There are also three durations:

  • Block is the elapsed time between CreatedAt and RunnableAt, and is the amount of time elapsed before all the jobs this job depends on were completed. Anything other than a zero here means that blocking has occurred.
  • Wait is the elapsed time between RunnableAt and StartedAt. A job becomes runnable when all the jobs it depends on have completed but even it still may not be able to start because Power BI only allows a certain number of jobs to refresh in parallel (see this post for more details and how to control the amount of parallelism). Waiting is described in the example below.
  • Run is the elapsed time between StartedAt and FinishedAt and is the amount of time the job itself took to run.

In this case you can see that the value for Block for XYUnion is 1 minute: X and Y have no preceding jobs so they kick off at the same time, X takes 1 minute to run and Y takes 10 seconds, so it is 1 minute before XYUnion can run. The popups for X and Y show 1 minute and 10 seconds respectively for Run, as you would expect:

One last thing to mention: in the full job graph diagram above you’ll see that certain nodes are highlighted in red. That’s because they are on the critical path, which is documented here; it’s the chain of jobs that dictates the overall length of the refresh, so if you want to make your refresh faster then you need to tune the jobs on the critical path. In this case the critical path goes through X to XYUnion: if you wanted the whole semantic model to refresh faster you would need to tune the refresh for either X or XYUnion; tuning the refresh for Y would make no difference to the overall semantic model refresh time.

Waiting

As I already mentioned, there is a limit on the number of jobs within a refresh that can run in parallel. The maximum number of jobs is represented by the number of “slots” – one slot can only run one job at a time – and in the screenshots of the popups above you can see each job has a slot number. If there are more jobs that could be run than there are slots available at a given time then some jobs have to wait for a slot.

Here’s another example: a semantic model with three tables, A, B and C, which each take 30 seconds to refresh.

There are no dependencies between the tables so in theory each of these three tables could refresh in parallel. However if you refresh with maxParallelism set to two then only two of the three can refresh at any one time. Here’s the job graph for a refresh that does that:

As you can see the critical path goes through the refresh jobs for table A, and hovering over the “Process Partition A[A]” job shows the following:

While this job was not blocked at all because there are no jobs it depends on, it had to wait 30 seconds for a slot to become available; it eventually ran in slot 0. Hovering over the nodes for “Process Partition B[B]” and “Process Partition C[C]” shows that they ran in slots 0 and slot 1 respectively and neither of them were blocked or had to wait.

The job graph isn’t always the best way of visualising this type of parallelism; Phil Seamark’s original Power BI report for visualising refreshes has a page which shows the slots and the jobs in them but I think there’s probably a better way of visualising all of this data that shows slots as well as dependencies. Maybe a Deneb visual is the answer? If anyone has ideas I’d be interested to hear them! In any case, the first step to doing this is to extract all of this data from the .DGML file into a table and that’s what I’ll demonstrate how to do in the next post in this series.

Visualising Power BI Import Mode Refresh Job Graphs

A few years ago a new pair of Profiler events was added for Power BI Import mode datasets (and indeed AAS models): the Job Graph events. I blogged about them here but they never got used by anyone because it was extremely difficult to extract useful data from them – you had to run a Profiler trace, save the trace file, run a Python script to generate a .dgml file, then open that file in Visual Studio – which was a shame because they contain a lot of really interesting, useful information. The good news is that with the release of Semantic Link in Fabric and the ability to run Profiler traces from a Fabric notebook it’s now much easier to access Job Graph data and in this blog post I’ll show you how.

Quick recap: what are the Job Graph events and why are they useful? Let’s say you have a Power BI Import mode semantic model and you want to optimise refresh performance. When you refresh a semantic model, that refresh is made up of multiple jobs which themselves are made up of multiple jobs: refreshing a semantic model involves refreshing all the tables in that model, refreshing a table involves refreshing all the partitions in that table, refreshing a partition involves loading the data and building attribute hierarchies, and so on. Some of these jobs can happen in parallel but in some cases there are dependencies between jobs, so one job can only start when another has completed. The Job Graph events give you information on these refresh jobs and the dependencies between them so you can work out which jobs you need to optimise. In order to capture information from them you need to run a trace while the semantic model is being refreshed; the data from some of these Job Graph events can be reconstituted into a Directed Graph Markup Language (DGML) file, which is an XML-based format, and once you’ve got that you can either visualise the DGML file using a suitable viewer or extract the data from it and analyse it further.

[Before I carry on I have to acknowledge that I’m extremely new at Python and a lot of the code in this post is adapted from the code in my colleague Phil Seamark’s excellent recent post on visualising Power BI refresh information with Semantic Link. Any feedback on ways to optimise the code is gratefully received.]

Here’s some Python code that you can use in a Fabric notebook to run a refresh and generate a DGML file. Each code snippet can be used in a separate code cell or combined into a single cell.

First of all you need to install Semantic Link:

%pip install semantic-link

Next you need to define the events you want in your trace, which in this case are just the Job Graph events:

import sempy.fabric as fabric
import pandas as pd
import time
import warnings

base_cols = ["EventClass", "EventSubclass", "TextData", "IntegerData"]

# define events to trace and their corresponding columns

event_schema = {
"JobGraph": base_cols
}

warnings.filterwarnings("ignore")

You then need to start a trace using this definition, refresh the semantic model, stop the trace and filter the events captured so you only have those with the EventSubclass GraphFinished, remove the event which contains the metadata (which has a value of 0 in the IntegerData column) and then finally sort the rows in ascending order by the values in the IntegerData column:

WorkspaceName = "Insert workspace name here"
SemanticModelName = "Insert semantic model name here"

with fabric.create_trace_connection(SemanticModelName,WorkspaceName) as trace_connection:
# create trace on server with specified events
with trace_connection.create_trace(event_schema, "Simple Refresh Trace") as trace:

trace.start()

# run the refresh
request_status_id = fabric.refresh_dataset(SemanticModelName, WorkspaceName, refresh_type="full")
print("Progress:", end="")

while True:
status = fabric.get_refresh_execution_details(SemanticModelName, request_status_id, WorkspaceName).status
if status == "Completed":
break

print("░", end="")
time.sleep(2)

print(": refresh complete")
# allow ending events to collect
time.sleep(5)

# stop Trace and collect logs
final_trace_logs = trace.stop()



# only return GraphFinished events
final_trace_logs = final_trace_logs[final_trace_logs['Event Subclass'].isin(["GraphFinished"])]
# ignore metadata row
final_trace_logs = final_trace_logs[final_trace_logs['Integer Data'].ne(0)]
# sort in ascending order by Integer Data column
final_trace_logs = final_trace_logs.sort_values(by=['Integer Data'], ascending=True)

Finally, you need to take all the text from the EventText column of the remaining events and concatenate it to get the contents of the DGML file and then save that file to the Files section of the Lakehouse attached to your notebook:

# concatenate all text in TextData column
out = ''.join(final_trace_logs['Text Data'])
# change background colour of critical path nodes so it's easier to see in VS Code
out = out.replace("#263238", "#eba0a7")

# write dgml file
dgmlfile = open("/lakehouse/default/Files/" + request_status_id + ".dgml", 'x')
print (out, file=dgmlfile)
dgmlfile.close()

#dispose of trace connection
trace_connection.disconnect_and_dispose()

I found a nice Visual Studio Code extension called DGMLViewer which makes viewing DGML files easy. Rather than manually downloading the file, OneLake Explorer makes it easy to sync files in OneLake with your PC in a very similar way to OneDrive, which makes working with these DGML files in VS Code very straightforward because you can simply open the local copy when it syncs.

Here’s what one of thse DGML files, generated from the refresh of a very basic semantic model, looks like when viewed in DGML Viewer:

If you have Visual Studio you can also use it to view DGML files (you need to install the DGML Editor first); I found a VS extension called DgmlPowerTools 2022 which adds some advanced features. Here’s what a DGML file for a refresh looks like when visualised in Visual Studio 2022:

OK, so this looks cool but it also looks very complicated. What does it all mean? How can you interpret all this information and use it to optimise a refresh? That’s something for a future blog post!

[In my next post I look at how you can interpret this data and understand the concepts of blocking and waiting, and in the post after that show how you can extract the data in this DGML file to a table using a Dataflow]

Nineteenth Blog Birthday: Fabric, The Hedgehog And The Fox

Every year, on the anniversary of the first-ever post on this blog, I write a post reflecting on what has happened to me professionally in the past year and this year the subject is obvious: Fabric. Of course I knew about, and indeed had been playing with, Fabric for a long time before even the private previews but it was only with the public preview and then GA that I felt the real impact on my job. I now work on the Fabric Customer Advisory Team at Microsoft rather than the Power BI Customer Advisory Team, for example, and a lot more than my job title has changed. Instead of helping customers with, and collecting feedback on, just Power BI I now have to know about the whole of Fabric – which is a challenge. But how much can anyone know about the whole of Fabric? Is it all too much for one person?

This is already a problem with Power BI of course. Back when Power BI first appeared I felt reasonably confident that I knew a fair amount about everything to do with it. Soon, though, I gave up on trying to know much about data visualisation and now when I see the amazing things people do with Deneb I’m full of admiration but I accept this is something I’ll never have time to learn. I think I’m better than average at DAX but I’ll never be anywhere near as good as Marco and Alberto. Even with Power Query and M my knowledge of creating custom connectors is limited. And that’s all ok, I know what I need to know to do my job and luckily I’m not the kind of person who feels overwhelmed by the constant flow of updates although I know a lot of people do (see this great article by Kurt Buhler on ways of dealing with this). If I don’t know everything about Power BI today then I can say confidently that I’ll never know everything about Fabric in the future. No-one can. You need a team. The question then is, then, how will job roles evolve within the Fabric community? Who will end up doing what, and how will we organise Fabric development?

There are two ways you can approach Fabric development, and the two ways remind me of Isiah Berlin’s famous essay on “The Hedgehog and the Fox” (who says arts degrees are useless?). Unless you’re a fan of Tolstoy you don’t need to read it all; the essay turns on a line from the Greek poet Archilochus which says: “The fox knows many things, but the hedgehog knows one big thing” and explores how thinkers and writers, indeed all human beings, can be classified as either ‘foxes’ or ‘hedgehogs’. I think this classification can be useful when thinking about Fabric development too.

I’m definitely a hedgehog: I like to go deep rather than wide. Things were easier in the old days when the Microsoft BI stack was made of discrete products (Analysis Services, Integration Services, Reporting Services, SQL Server) and you could specialise in just one or two – the dividing lines were clear and pre-drawn and everyone working in BI had to be a hedgehog. It’s tempting to take the same approach with Fabric and encourage people to specialise in individual workloads, to become Fabric Spark or Fabric Warehouse professionals for example. It’s unsurprising that people from an IT background are natural hedgehogs but I also think this risks repeating the mistakes of the past: going back to the old days again, when I was an Analysis Services consultant I regularly saw teams who were focused on building their data warehouse in SQL Server but who never gave a second thought to what was meant to happen downstream, so the cubes, reports and spreadsheets suffered accordingly and so the solution as a whole was seen as a failure by the business. While it might seem as though a team of hedgehogs will give you a team of experts in all the areas you need, it can often result in a team of people focused on building specific parts of a solution rather than a functioning whole.

The alternative is to be a fox and I think most Power BI users who come from the business or are data analysts are foxes: they have to know a bit about everything in order to build an end-to-end solution because they work on their own and have to get stuff done. Self-service BI is inherently foxy. While a single Fabric fox can build a whole solution, though, and is more likely to deliver value to the business, the quality of the constituent parts may be variable depending on how well the developer knows each workload. A team of foxes won’t solve that problem either: two people who half-understand DAX aren’t equivalent to one person who is a DAX expert. Too much self-service development delivers short-term business value which is cancelled out in the long-term under the weight of the technical debt incurred.

I don’t think you choose to be a fox or a hedgehog; your personality, learning style and job circumstances push you in one direction or the other. In some organisations there will only be foxes or only be hedgehogs with all the attendent consequences. However, the fact that Fabric brings together everything you need for BI development in a single, tighly-integration platform and the fact that its functionality and price means it’s suitable for self-service and enterprise BI development, means that there’s an opportunity to combine hedgehogs and foxes in a new, more effective way.

The magic formula for Fabric must be similar to what the most successful Power BI organisations are already doing today: the hedgehogs in IT build the foundations, provide the core data, model it, and then the foxes in the business take what these components and turn them into solutions. This will require a big change in the way the hedgehogs work and think: no more BI projects, no more building end-to-end. It should be rare for a hedgehog to ever build a report for the business. Instead the aim is to build a platform. Equally the foxes will have to change the way they work too: no more direct access to data sources, no more building end-to-end. It should be rare for a fox to ever bring external data into Fabric. Instead the aim is to build on the platform that has been provided by the hedgehogs. If the hedgehogs have done their job properly then the foxes will be happy – although that’s a big “if”. Finding the right organisational structures to bring hedgehogs and foxes together will be a challenge; it will be even more difficult to overcome the mutual distrust that has always existed between IT and the business.

So, to go back to the original question, who will do what in Fabric? On the hedgehog side there will need to be a platform architect and various specialists: in data ingestion, Spark and/or in Fabric Warehouse, in real-time analytics, in designing core semantic models and DAX. On the fox side the required skills will be basic data transformation using Dataflows, basic semantic modelling/DAX using OneLake, and advanced report design and data visualisation. This seems to me like a division of labour that results in each team member having an area of Fabric they have a chance of learning and mastering and one that will deliver the best outcomes.

Power Query Nested Data Types In Excel

A year ago support for nested data types in Excel was announced on the Excel blog, but the announcement didn’t have much detail about what nested data types are and the docs are quite vague too. I was recently asked how to create a nested data type and while it turns out to be quite easy, I thought it would be good to write a short post showing how to do it.

Let’s say you have a Power Query query that returns details of the different types of fruit that you sell in your shop:

Let’s also say that the last three columns in this table (Grower, Address and Specialty) all relate to the farmer that supplies you with this fruit. Now you could create one Excel data type with all these columns in, but nested data types allow you to create data types inside data types, so in this case you can create a data type specifically for these three columns relating to the farmer and then nest it inside the main data type.

To do this, select these three columns in the Power Query Editor and click the Create Data Type button on the Transform tab in the ribbon:

Give the data type a name, in this case Farmer, in the Create Data Type dialog:

At this point you’ll have a query that returns a table where one of the columns, Farmer, contains a data type:

Finally, you then select all the columns in this table, including the Farmer column, and click the Create Data Type button again to create another new data type, this time called Product:

Here’s what you’ll see in the Power Query Editor once you’ve done this:

And here’s the result in the Power Query Editor:

Once you’ve loaded this query to the worksheet you can explore the nested data type via the card popup:

Or you can access the data in the nested data type using a formula. For example, in the animated gif above the cell A2 contains the data type for Apples, so the formula

=A2.Farmer.Address

…returns the address of the farmer who grows apples.

Alternatively, you can use the Excel FieldValue function to get the same result:

=FIELDVALUE(FIELDVALUE(A2, "Farmer"), "Address")