Performance Implications Of Using Calculated Columns In Composite Models On Power BI Semantic Models

I don’t have anything against the use of calculated columns in Power BI semantic models in general but you do need to be careful using them with DirectQuery mode. In particular when you have a DirectQuery connection to another Power BI semantic model – also known as a composite model on a Power BI semantic model – it’s very easy to cause serious performance problems with calculated columns. Let’s see a simple example of why this is.

Let’s say you have an Import mode semantic model called Source Model containing sales data:

Here’s the contents of the Sales table:

And here’s the definition of the Sales Amount measure:

Sales Amount = SUM(Sales[Sales])

Now, let’s say this semantic model gets published to the Service and you open Power BI Desktop, create a Live connection to this model and then click the “Make changes to this model” button to create a composite model. You now have a local composite model in Power BI Desktop that is connected to the Source Model semantic model.

Next, you decide to create a calculated column on the Sales table and after reading Marco and Alberto’s article on circular dependencies (because you’ll definitely run into circular dependency errors) come up with the following code:

Tax =
CALCULATE (
[Sales Amount] * 0.1,
ALLEXCEPT ( Sales, Sales[TransactionNumber] )
)

Great! You can now build a report that looks like this:

This is where the first warning comes though. In Import mode semantic models calculated columns are evaluated when the model is refreshed, but in DirectQuery mode they are evaluated at query time if the query needs to use them. Performance Analyzer shows two DAX queries generated for the visual above. The first is the query that the visual runs against the local semantic model in Power BI Desktop, which looks like this:

DEFINE
VAR __DS0Core =
SUMMARIZECOLUMNS(
ROLLUPADDISSUBTOTAL(ROLLUPGROUP('Country'[Country], 'Product'[Product]), "IsGrandTotalRowTotal"),
"Sales_Amount", 'Sales'[Sales Amount],
"SumTax", CALCULATE(SUM('Sales'[Tax]))
)

VAR __DS0PrimaryWindowed =
TOPN(502, __DS0Core, [IsGrandTotalRowTotal], 0, 'Country'[Country], 1, 'Product'[Product], 1)

EVALUATE
__DS0PrimaryWindowed
ORDER BY
[IsGrandTotalRowTotal] DESC, 'Country'[Country], 'Product'[Product]

There’s nothing much interesting here. However, there’s a second DAX query generated: the one generated by the local model to get the data it needs from the Source Model semantic model in the Service. This query looks like this:

Define
COLUMN 'Sales'[ASDQ_Tax] = CALCULATE([Sales Amount] * 0.1, ALLEXCEPT(Sales, Sales[TransactionNumber]))

var ASDQ___DS0Core = SUMMARIZECOLUMNS( ROLLUPADDISSUBTOTAL(ROLLUPGROUP('Country'[Country], 'Product'[Product]), "IsGrandTotalRowTotal"),
"Sales_Amount", [Sales Amount],
"SumTax", CALCULATE(SUM('Sales'[ASDQ_Tax]))
)
var ASDQ___DS0PrimaryWindowed = TOPN(502, ASDQ___DS0Core, [IsGrandTotalRowTotal], 0, 'Country'[Country], 1, 'Product'[Product], 1)

EVALUATE
ASDQ___DS0PrimaryWindowed
ORDER BY
[IsGrandTotalRowTotal] DESC, 'Country'[Country], 'Product'[Product]

Notice, right at the top of the query, a DEFINE statement that defines the Tax calculated column. Every time this visual is rendered in your Power BI report the calculated column will be re-evaluated and that will have a performance overhead. If the calculated column had been created in Source Model (which is Import mode remember) it would have been evaluated when the model was refreshed and any queries that used it would probably be faster as a result.

There’s something else to watch out for though. Let’s say you define a second calculated column in the local model like so:

Apple Sales =
CALCULATE (
[Sales Amount],
FILTER ( ALLNOBLANKROW ( 'Sales'[Product] ), 'Sales'[Product] = "Apples" )
)

You can now create a table that looks like this, which does include the Apple Sales column but does not include the Tax column:

The DAX query sent by the visual to the local model is not worth looking at. However, the DAX query sent by the local model to Source Model does have something surprising in it:

Define
COLUMN 'Sales'[ASDQ_Apple Sales] = CALCULATE([Sales Amount], FILTER(ALLNOBLANKROW('Sales'[Product]), 'Sales'[Product]="Apples"))

COLUMN 'Sales'[ASDQ_Tax] = CALCULATE([Sales Amount] * 0.1, ALLEXCEPT(Sales, Sales[TransactionNumber]))

var ASDQ___DS0Core = SUMMARIZECOLUMNS( ROLLUPADDISSUBTOTAL(ROLLUPGROUP('Country'[Country], 'Product'[Product]), "IsGrandTotalRowTotal"),
"Sales_Amount", [Sales Amount],
"SumApple_Sales", CALCULATE(SUM('Sales'[ASDQ_Apple Sales]))
)

var ASDQ___DS0PrimaryWindowed = TOPN(502, ASDQ___DS0Core, [IsGrandTotalRowTotal], 0, 'Country'[Country], 1, 'Product'[Product], 1)

EVALUATE
ASDQ___DS0PrimaryWindowed
ORDER BY
[IsGrandTotalRowTotal] DESC, 'Country'[Country], 'Product'[Product]

There’s a DEFINE statement for the Apple Sales column as you would expect. However there is also a DEFINE statement for the Tax column, which means it is evaluated too even though it’s not shown in the visual and it seems as though the Apple Sales column doesn’t reference it – or does it? I’m not going to try to explain what’s going on here (again, this is best left to Marco and Alberto) but the important point is that both calculated columns are now being evaluated at query time which means the query will be even slower. If you have large data volumes and/or your calculated columns contain some complex DAX you could end up with some very slow reports. If you’re using Premium then you could also be putting a lot of extra load on your capacity.

Composite models are an important part of Power BI’s self-service story and it’s inevitable that at some point your developers will want to use build calculated columns in them. As always you need to monitor the performance of your reports carefully, both in terms of duration and CPU usage, while you are building them to make sure they are well optimised and keep the use of calculated columns in composite models to a minimum.

Understanding The “Evaluation resulted in a stack overflow” Error In Power Query In Excel And Power BI

If you’re writing your own M code in Power Query in Power BI or Excel you may run into the following error:

Expression.Error: Evaluation resulted in a stack overflow and cannot continue.

If you’re a programmer you probably know what a stack overflow is; if you’re not you might search for the term, find this Wikipedia article and still have no clue what has happened. Either way it may still be difficult to understand why you’re running into it in Power Query. To explain let’s see some examples.

First, a really simple one. You can write recursive functions – functions that call themselves – in Power Query, although as you might suspect that is generally a bad idea because they are difficult to write, difficult to understand, slow and may result in the error above. Consider the following function called MyRecursiveFunction which references a parameter of type number called MaxDepth:

(counter as number) as number=>
if counter>MaxDepth then counter else @MyRecursiveFunction(counter+1)

The function takes a number and calls itself, passing in a value one greater than the number that was passed to it, until the number passed is greater than MaxDepth. So, if the value of MaxDepth is 3 and you call the function and pass it the value of 1, like so:

MyRecursiveFunction(1)

…then you’ll get the value 4 back:

So far so good. But how long can a function in M go on calling itself? As you can imagine, it’s not forever. So, when you hit the point where the function can’t go on calling itself then you get the error above. For example if you try setting MaxDepth to 100000 then you’ll get the stack overflow error above instead of 100001:

As a result it’s almost always a good idea to avoid recursion in Power Query and use functions like List.Transform, List.Generate or List.Accumulate to achieve the same result. A great example of this is shown in the Power Query custom connector documentation in the section on handling APIs that return results broken up into pages with the Table.GenerateByPage code sample.

You may still get this error even when you’re not explicitly using recursion though, as a result of lazy evaluation. Consider the following query which uses List.Accumulate to generate a table with a given number of rows:

let
//define a table with one row and one column called x
MyTable = #table(type table [x = number], {{1}}),
//specify how many rows we want in our output table
NumberOfTimes = 3,
//Use List.Accumulate to create a table with this number of rows
//By calling Table.Combine
CombineAllTables = List.Accumulate(
{1 .. NumberOfTimes},
null,
(state, current) => if current = 1 then MyTable else Table.Combine({state, MyTable})
)
in
CombineAllTables

Here’s the output, a table with three rows:

But how does it get this result? With NumberOfTimes=3 you can think of this query lazily building up an M expression something like this:

Table.Combine({Table.Combine({MyTable, MyTable}), MyTable})

…which, once List.Accumulate has finished, suddenly all has to be evaluated and turned into a single table. Imagine how much nesting of Table.Combine there would be if NumberOfTimes was a much larger number though! And indeed, it turns out that you can’t make lots and lots of calls to Table.Combine without running into a stack overflow. So if NumberOfTimes=100000 like so:

let

//define a table with one row and one column called x
MyTable = #table(type table [x = number], {{1}}),
//specify how many rows we want in our output table
NumberOfTimes = 100000,
//Use List.Accumulate to create a table with this number of rows
//by calling Table.Combine
CombineAllTables = List.Accumulate(
{1 .. NumberOfTimes},
null,
(state, current) => if current = 1 then MyTable else Table.Combine({state, MyTable})
)
in
CombineAllTables

…then, after a minute or so, you get the “Evaluation resulted in a stack overflow and cannot continue” error again.

Rewriting the query so you build up the list of tables first and only call Table.Combine once at the end avoids the problem and is much faster:

let
//define a table with one row and one column called x
MyTable = #table(type table [x = number], {{1}}),
//specify how many rows we want in our output table
NumberOfTimes = 100000,
//create a table with NumberOfTimes rows
CombineAllTables = Table.Combine(List.Repeat({MyTable}, NumberOfTimes))
in
CombineAllTables

It’s also possible to solve the problem by forcing eager evaluation inside List.Accumulate but this is extremely tricky: there’s an example of this on Gil Raviv’s blog here.

New Limits For The “Maximum Connections Per Data Source” Property In Power BI DirectQuery Mode

One of the most important properties you can set in a Power BI DirectQuery semantic model is the “Maximum connections per data source” property, which controls the number of connections that can be used to run queries against a data source. The good news is that the maximum value that you can set this property to has just been increased in Premium.

This property is important because the number of open connections to a data source acts as a limit on the number of concurrent queries that can be run by the semantic model against the source: each connection can only have one query running on it at any one time. If you have Power BI reports that have a large number of visuals on a page and/or a large number of users running reports at the same time then it is very easy for a DirectQuery semantic model to need to send lots of queries back to your data source at the same time. If some of the queries that your semantic model runs against your data source are slow – more than one or two seconds even – then the number of queries that need to be run at a given time will increase. The same is true if you have increased the Max Parallelism Per Query property to increase the number of parallel queries that can be generated by a single DAX query.

This property is documented in a number of places, including the DirectQuery guidance documentation and in data source-specific best practice documents such as this one for Snowflake. You can set the property in Power BI Desktop in the Current File/DirectQuery section of the Options dialog:

If you are not using Power BI Premium (ie you are using Power BI Shared capacity, also known as Power BI Pro) then the maximum value that you can set this property to is 10. If you are using Power BI Premium then the maximum value up to today was 30 but now that limit has been increased. The table on this page shows what the new limits per SKU are:

As you can see, for a P1/F64 the maximum limit is now 50 rather than 30 and this limit goes all the way up to 200 for a P4/F512 and higher.

I’ve seen plenty of cases where increasing the value of this property makes Power BI reports run a lot faster. However, this will only be true if your data source is able to handle the number of queries that Power BI is trying to run against it. As I showed in this post, if your data source can’t handle the number of queries you’re trying to run then performance will get worse and not better, so you should try different values to see which one works best.

Extracting Power BI Import Mode Job Graph Data To A Table

In my last post but one I showed how you could create a DGML file that contains detailed information on what happens during a Power BI Import mode refresh – the “job graph” – and visualise that in Visual Studio or VS Code, to help with performance tuning. In my last post, I explained the concepts of blocking and waiting which are key to understanding this data. In this post I’ll share and describe a Dataflow template that extracts the data contained in the DGML file to a table for other types of analysis – which turns out to be quite simple, because DGML is an XML-based format which is easy to work with in Power Query and Dataflows.

You can download the .pqt template file here (for more information on how to use templates see here). To use it, create a new Dataflow Gen2 and import the template file. You’ll see the following queries:

Next, change the values of the three parameters to contain the IDs of the refresh whose DGML file you want to load (the notebook in this post that creates the DGML file uses the refresh ID as the file’s name) and the workspace and lakehouse where that file is stored. Finally, set output destinations on either the RefreshJobs and/or RefreshJobsPivoted queries. The RefreshJobs query returns a table with one row per refresh job:

This gives you all the data in the DGML file that was visible using a DGML viewer but in table form. For each job you get the CreatedAt, RunnableAt, StartedAt and FinishedAt datetimes, the Blocked, Wait and Running durations, the Slot ID, the CPU used and whether the job is on the critical path.

The RefreshJobsPivoted query gives you exactly the same data but instead of giving you one row per job, you get three: one row for the blocked, waiting and running phases of each job, with the start and end times for each phase and the duration. This may be a more convenient format for visualisation and analysis:

The Links query gives you the dependencies between the jobs:

Having loaded all this data into a Lakehouse you can now build a report on it. As I said, I’m sure there’s a way of visualising all of this data in a way that shows all the durations and dependencies, but even a basic table report on the RefreshJobs table like this is really useful because it shows you which jobs were blocked, which ones had to wait, and which ones took a long time to run:

In this example (the Dependencies example from my previous post, but where the refresh has maxParallelism set to 1, so there is blocking as well as waiting) you can see that refreshing table X took 60 seconds, that the refresh for table Y had to wait for 60 seconds before it could start and took 10 seconds, and that refresh for table XY was blocked for 70 seconds before it could start. So, if you wanted to make this refresh run faster, you would want to understand why table X was so slow and also look at increasing the amount of parallelism so Y did not have to wait.

Understanding Blocking And Waiting In Power BI Import Mode Refreshes

Following on from my previous post showing how you can visualise the job graph for a Power BI Import mode semantic model refresh, I this post I will look at how you can interpret what the job graph tells you – specifically, explaining the concepts of blocking and waiting. As always, simple examples are the best way of doing this.

Blocking

Blocking occurs when one job can only start when one or more other jobs have completed because there is a dependency between them. Consider the following semantic model:

X and Y are tables that contain a single numeric column. X takes 1 minute to load while Y takes 10 seconds to load (I forced this delay using technique I blogged about here). XYUnion is a DAX calculated table that unions all the rows from X and Y with the following definition:

XYUnion = UNION(X,Y)

As you can imagine, the job that refreshes XYUnion can only start once the jobs that refresh both X and Y have finished; XYUnion will be blocked until both X and Y have refreshed. Here’s what the job graph for the refresh of this semantic model looks like:

At the centre is the job “Process Partition XYUnion[XYUnion]” which refreshes the calculated table XYUnion. The arrows going from this job to the jobs “Process Partition X[X]” and “Process Partition Y[Y]”, which refresh tables X and Y, show that this job depends on those two other jobs.

Hovering over the “Process Partition XYUnion[XYUnion” job shows the following popup:

There are four datetime values here: CreatedAt, RunnableAt, StartedAt and FinishedAt. There are also three durations:

  • Block is the elapsed time between CreatedAt and RunnableAt, and is the amount of time elapsed before all the jobs this job depends on were completed. Anything other than a zero here means that blocking has occurred.
  • Wait is the elapsed time between RunnableAt and StartedAt. A job becomes runnable when all the jobs it depends on have completed but even it still may not be able to start because Power BI only allows a certain number of jobs to refresh in parallel (see this post for more details and how to control the amount of parallelism). Waiting is described in the example below.
  • Run is the elapsed time between StartedAt and FinishedAt and is the amount of time the job itself took to run.

In this case you can see that the value for Block for XYUnion is 1 minute: X and Y have no preceding jobs so they kick off at the same time, X takes 1 minute to run and Y takes 10 seconds, so it is 1 minute before XYUnion can run. The popups for X and Y show 1 minute and 10 seconds respectively for Run, as you would expect:

One last thing to mention: in the full job graph diagram above you’ll see that certain nodes are highlighted in red. That’s because they are on the critical path, which is documented here; it’s the chain of jobs that dictates the overall length of the refresh, so if you want to make your refresh faster then you need to tune the jobs on the critical path. In this case the critical path goes through X to XYUnion: if you wanted the whole semantic model to refresh faster you would need to tune the refresh for either X or XYUnion; tuning the refresh for Y would make no difference to the overall semantic model refresh time.

Waiting

As I already mentioned, there is a limit on the number of jobs within a refresh that can run in parallel. The maximum number of jobs is represented by the number of “slots” – one slot can only run one job at a time – and in the screenshots of the popups above you can see each job has a slot number. If there are more jobs that could be run than there are slots available at a given time then some jobs have to wait for a slot.

Here’s another example: a semantic model with three tables, A, B and C, which each take 30 seconds to refresh.

There are no dependencies between the tables so in theory each of these three tables could refresh in parallel. However if you refresh with maxParallelism set to two then only two of the three can refresh at any one time. Here’s the job graph for a refresh that does that:

As you can see the critical path goes through the refresh jobs for table A, and hovering over the “Process Partition A[A]” job shows the following:

While this job was not blocked at all because there are no jobs it depends on, it had to wait 30 seconds for a slot to become available; it eventually ran in slot 0. Hovering over the nodes for “Process Partition B[B]” and “Process Partition C[C]” shows that they ran in slots 0 and slot 1 respectively and neither of them were blocked or had to wait.

The job graph isn’t always the best way of visualising this type of parallelism; Phil Seamark’s original Power BI report for visualising refreshes has a page which shows the slots and the jobs in them but I think there’s probably a better way of visualising all of this data that shows slots as well as dependencies. Maybe a Deneb visual is the answer? If anyone has ideas I’d be interested to hear them! In any case, the first step to doing this is to extract all of this data from the .DGML file into a table and that’s what I’ll demonstrate how to do in the next post in this series.

Visualising Power BI Import Mode Refresh Job Graphs

A few years ago a new pair of Profiler events was added for Power BI Import mode datasets (and indeed AAS models): the Job Graph events. I blogged about them here but they never got used by anyone because it was extremely difficult to extract useful data from them – you had to run a Profiler trace, save the trace file, run a Python script to generate a .dgml file, then open that file in Visual Studio – which was a shame because they contain a lot of really interesting, useful information. The good news is that with the release of Semantic Link in Fabric and the ability to run Profiler traces from a Fabric notebook it’s now much easier to access Job Graph data and in this blog post I’ll show you how.

Quick recap: what are the Job Graph events and why are they useful? Let’s say you have a Power BI Import mode semantic model and you want to optimise refresh performance. When you refresh a semantic model, that refresh is made up of multiple jobs which themselves are made up of multiple jobs: refreshing a semantic model involves refreshing all the tables in that model, refreshing a table involves refreshing all the partitions in that table, refreshing a partition involves loading the data and building attribute hierarchies, and so on. Some of these jobs can happen in parallel but in some cases there are dependencies between jobs, so one job can only start when another has completed. The Job Graph events give you information on these refresh jobs and the dependencies between them so you can work out which jobs you need to optimise. In order to capture information from them you need to run a trace while the semantic model is being refreshed; the data from some of these Job Graph events can be reconstituted into a Directed Graph Markup Language (DGML) file, which is an XML-based format, and once you’ve got that you can either visualise the DGML file using a suitable viewer or extract the data from it and analyse it further.

[Before I carry on I have to acknowledge that I’m extremely new at Python and a lot of the code in this post is adapted from the code in my colleague Phil Seamark’s excellent recent post on visualising Power BI refresh information with Semantic Link. Any feedback on ways to optimise the code is gratefully received.]

Here’s some Python code that you can use in a Fabric notebook to run a refresh and generate a DGML file. Each code snippet can be used in a separate code cell or combined into a single cell.

First of all you need to install Semantic Link:

%pip install semantic-link

Next you need to define the events you want in your trace, which in this case are just the Job Graph events:

import sempy.fabric as fabric
import pandas as pd
import time
import warnings

base_cols = ["EventClass", "EventSubclass", "TextData", "IntegerData"]

# define events to trace and their corresponding columns

event_schema = {
"JobGraph": base_cols
}

warnings.filterwarnings("ignore")

You then need to start a trace using this definition, refresh the semantic model, stop the trace and filter the events captured so you only have those with the EventSubclass GraphFinished, remove the event which contains the metadata (which has a value of 0 in the IntegerData column) and then finally sort the rows in ascending order by the values in the IntegerData column:

WorkspaceName = "Insert workspace name here"
SemanticModelName = "Insert semantic model name here"

with fabric.create_trace_connection(SemanticModelName,WorkspaceName) as trace_connection:
# create trace on server with specified events
with trace_connection.create_trace(event_schema, "Simple Refresh Trace") as trace:

trace.start()

# run the refresh
request_status_id = fabric.refresh_dataset(SemanticModelName, WorkspaceName, refresh_type="full")
print("Progress:", end="")

while True:
status = fabric.get_refresh_execution_details(SemanticModelName, request_status_id, WorkspaceName).status
if status == "Completed":
break

print("░", end="")
time.sleep(2)

print(": refresh complete")
# allow ending events to collect
time.sleep(5)

# stop Trace and collect logs
final_trace_logs = trace.stop()



# only return GraphFinished events
final_trace_logs = final_trace_logs[final_trace_logs['Event Subclass'].isin(["GraphFinished"])]
# ignore metadata row
final_trace_logs = final_trace_logs[final_trace_logs['Integer Data'].ne(0)]
# sort in ascending order by Integer Data column
final_trace_logs = final_trace_logs.sort_values(by=['Integer Data'], ascending=True)

Finally, you need to take all the text from the EventText column of the remaining events and concatenate it to get the contents of the DGML file and then save that file to the Files section of the Lakehouse attached to your notebook:

# concatenate all text in TextData column
out = ''.join(final_trace_logs['Text Data'])
# change background colour of critical path nodes so it's easier to see in VS Code
out = out.replace("#263238", "#eba0a7")

# write dgml file
dgmlfile = open("/lakehouse/default/Files/" + request_status_id + ".dgml", 'x')
print (out, file=dgmlfile)
dgmlfile.close()

#dispose of trace connection
trace_connection.disconnect_and_dispose()

I found a nice Visual Studio Code extension called DGMLViewer which makes viewing DGML files easy. Rather than manually downloading the file, OneLake Explorer makes it easy to sync files in OneLake with your PC in a very similar way to OneDrive, which makes working with these DGML files in VS Code very straightforward because you can simply open the local copy when it syncs.

Here’s what one of thse DGML files, generated from the refresh of a very basic semantic model, looks like when viewed in DGML Viewer:

If you have Visual Studio you can also use it to view DGML files (you need to install the DGML Editor first); I found a VS extension called DgmlPowerTools 2022 which adds some advanced features. Here’s what a DGML file for a refresh looks like when visualised in Visual Studio 2022:

OK, so this looks cool but it also looks very complicated. What does it all mean? How can you interpret all this information and use it to optimise a refresh? That’s something for a future blog post!

[In my next post I look at how you can interpret this data and understand the concepts of blocking and waiting, and in the post after that show how you can extract the data in this DGML file to a table using a Dataflow]

Incremental Refresh On Delta Tables In Power BI

One of the coolest features in Fabric is Direct Lake mode, which allows you to build Power BI reports directly on top of Delta tables in your data lake without having to wait for a semantic model to refresh. However not everyone is ready for Fabric yet so there’s also a lot of interest in the new DeltaLake.Table M function which allows Power Query (in semantic models or dataflows) to read data from Delta tables. If you currently have a serving layer – for example Synapse Serverless or Databricks SQL Warehouse – in between your existing lake house and your import mode Power BI semantic models then this new function could allow you to remove it, to reduce complexity and cut costs. This will only be a good idea, though, if refresh performance isn’t impacted and incremental refresh can be made to work well.

So is it possible to get good performance from DeltaLake.Table with incremental refresh? Query folding isn’t possible using this connector because there’s no database to query: a Delta table is just a folder with some files in. But query folding isn’t necessary for incremental refresh to work well: what’s important is that when Power Query filters a table by the datetime column required for incremental refresh, that query is significantly faster than reading all the data from that table. And, as far as I can see from the testing I’ve done, because of certain performance optimisations within DeltaLake.Table it should be possible to use incremental refresh on a Delta table successfully.

There are three factors that influence the performance of Power Query when querying a Delta table:

  1. The internal structure of the Delta table, in particular whether it is partitioned or not
  2. The implementation of the connector, ie the DeltaLake.Table function
  3. The M code you write in the queries used to populate the tables in your semantic model

There’s not much you can do about #2 – performance is, I think, good enough right now although there are a lot of optimisations that will hopefully come in the future – but #1 and #3 are definitely within your control as a developer and making the right choices makes all the difference.

Here’s what I did to test incremental refresh performance. First, I used a Fabric pipeline to load the NYC Taxi sample data into a table in a Lakehouse (for the purposes of this exercise a Fabric Lakehouse will behave the same as ADLSgen2 storage – I used a Lakehouse because it was easier). Then, in Power BI Desktop, I created an import mode semantic model pointing to the NYC taxi data table in the Lakehouse and configured incremental refresh. Here’s the M code for that table:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/unpartitionednyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
#"Filtered Rows" = Table.SelectRows(
ToDelta,
each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd
)
in
#"Filtered Rows"

Here’s the incremental refresh dialog:

I then published the semantic model and refreshed it via the Enhanced Refresh API from a notebook (Semantic Link makes this so much easier) using an effective date of 8th December 2013 to get a good spread of data. I used Phil Seamark’s new, notebook-based version of his refresh visualisation tool to see how long each partition took during an initial refresh:

The refresh took just over 30 minutes.

Next, using Spark SQL, I created a copy of the NYC taxi data table in my Lakehouse with a new datetime column added which removed everything apart from the date and I then partitioned the table by that new datetime column (called PickupDate here):

CREATE TABLE PartitionedByDateNYC

USING delta
PARTITIONED BY (PickupDate)
AS
SELECT *, date_trunc("Day", lpepPickupDateTime) as PickupDate
FROM NYCIncrementalRefreshTest.nyctaxi_raw

I created a copy of my semantic model, pointed it to the new table and reconfigured the incremental refresh to filter on the newly-created PickupDate column:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionedbydatenyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
#"Filtered Rows" = Table.SelectRows(
ToDelta,
each [PickupDate] >= RangeStart and [PickupDate] < RangeEnd
)
in
#"Filtered Rows"

…and refreshed again. This time the refresh took about 26 seconds.

Half an hour to 26 seconds is a big improvement and it’s because the DeltaLake.Table function is able to perform partition elimination: the partitions in the semantic model align to one or more partitions in the Delta table, so when each partition in the semantic model is refreshed Power Query only needs to read data from the partitions in the Delta table that contain the relevant data. This only happens because the filter in the Power Query query using the RangeStart and RangeEnd parameters is on the same column that is used to partition the Delta table.

In my final test I partitioned my Delta table by month, like so:

CREATE TABLE PartitionedNYC

USING delta
PARTITIONED BY (PickupYearMonth)
AS
SELECT *, (100*date_part('YEAR', lpepPickupDateTime)) + date_part('Months', lpepPickupDateTime) as PickupYearMonth
FROM NYCIncrementalRefreshTest.nyctaxi_raw

The challenge here is that:

  1. The new PickupYearMonth column is an integer column, not a datetime column, so it can’t be used for an incremental refresh filter in Power Query
  2. Power BI incremental refresh creates partitions at the year, quarter, month and date granularities, so filtering by month can’t be used for date partitions

I solved this problem in my Power Query query by calculating the month from the RangeStart and RangeEnd parameters, filtering the table by the PickupYearMonth column (to get partition elimination), stopping any further folding using the Table.StopFolding function and then finally filtering on the same datetime column I used in my first test:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionednyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
YearMonthRangeStart = (Date.Year(RangeStart) * 100) + Date.Month(RangeStart),
YearMonthRangeEnd = (Date.Year(RangeEnd) * 100) + Date.Month(RangeEnd),
FilterByPartition = Table.StopFolding(
Table.SelectRows(
ToDelta,
each [PickupYearMonth] >= YearMonthRangeStart and [PickupYearMonth] <= YearMonthRangeEnd
)
),
#"Filtered Rows" = Table.SelectRows(
FilterByPartition,
each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd
)
in
#"Filtered Rows"

Interestingly this table refreshed even faster: it took only 18 seconds.

This might just be luck, or it could be because the larger partitions resulted in fewer calls back to the storage layer. The AzureStorage.DataLake M function requests data 4MB at a time by default and this could result in more efficient data retrieval for the data volumes used in this test. I didn’t get round to testing if using non-default options on AzureStorage.DataLake improved performance even more (see here for more details on earlier testing I did with them).

To sum up, based on these tests it looks like incremental refresh can be used effectively in import mode semantic models with Delta tables and the DeltaLake.Table function so long as you partition your Delta table and configure your Power Query queries to filter on the partition column. I would love to hear what results you get if you test this in the real world so please let me know by leaving a comment.

Getting Different Versions Of Data With Value.Versions In Power Query

Something I mentioned in my recent post about the new DeltaLake.Tables M function on the Fabric blog recently was the fact that you can get different versions of the data held in a Delta table using the Value.Versions M function. In fact, the Value.Versions is the way to access different versions of data in any source that has this concept – so long as Power Query has added support for doing so. The bad news is that, at least at the the time of writing, apart from the DeltaLake connector there’s only one other source where Value.Versions can be used in this way: the connector for Fabric Lakehouses.

Here’s how you can access the data in a table using the Lakehouse.Contents M function:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data]
in
SelectTable

As with DeltaLake.Table, you can get a table with all the different versions available using Value.Versions:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
ShowVersions = Value.Versions(SelectTable)
in
ShowVersions

Version 0 is the earliest version; the latest version of the data is the version with the highest number and this version can also be accessed from the row with the version number null. The nested values in the Data column are tables which give you the data for that particular version number. So, for example, if I wanted to get the data for version 2 I could click through on the nested value in the Data column in the row where the Version column contained the value 2. Here’s the M code for this:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
ShowVersions = Value.Versions(SelectTable),
Data = ShowVersions{2}[Data]
in
Data

The Lakehouse connector uses the TDS Endpoint of the Lakehouse to get data by default, as in the first code snippet above, but if you use Value.Versions to get specific versions then this isn’t (as yet) possible so it will use a slower method to get data and performance may suffer.

Last of all, you can get the version number of the data you’re looking at using the Value.VersionIdentity function. If you’re looking at the latest version of the data then Value.VersionIdentity will return null:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
GetVersion = Value.VersionIdentity(SelectTable)
in
GetVersion

If you are looking at version 2 of the data then Value.VersionIdentity will return 2:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
ShowVersions = Value.Versions(SelectTable),
GetVersion2 = ShowVersions{2}[Data],
GetVersionNumber = Value.VersionIdentity(GetVersion2)
in
GetVersionNumber

Anti Joins And Query Folding In Power Query In Excel And Power BI

Power Query allows you to merge (“join” in database terms) two tables together in a variety of different ways, including left and right anti joins. Unfortunately, as I found recently, anti joins don’t fold on SQL Server-related data sources, which can result in performance problems. Luckily there is a different way of doing anti joins that does fold.

Say you have two Power Query queries called Fruit1 and Fruit2 that get data from SQL Server tables containing the names of different varieties of fruit:

Now, let’s say you want to get a list of all the fruit varieties that are in Fruit1 and not in Fruit2. The obvious thing to do is to do a Merge and use the Left Anti option like so:

Here’s the M code, including an extra step to remove the join column that this creates:

let
    Source = Table.NestedJoin(Fruit1, {"Fruit"}, Fruit2, {"Fruit"}, "Fruit2", JoinKind.LeftAnti),
    #"Removed Other Columns" = Table.SelectColumns(Source,{"Fruit"})
in
    #"Removed Other Columns"

This gives you the correct result:

…but it doesn’t fold, so performance may be bad.

However, if you do a Merge and use the Left Outer option instead:

Then expand the join column (called Fruit2.Fruit here):

And then filter on that column so you only keep the rows where it contains the value null, and then remove that column, you get the same result:

Here’s the M:

let
    Source = Table.NestedJoin(Fruit1, {"Fruit"}, Fruit2, {"Fruit"}, "Fruit2", JoinKind.LeftOuter),
    #"Expanded Fruit2" = Table.ExpandTableColumn(Source, "Fruit2", {"Fruit"}, {"Fruit2.Fruit"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Fruit2", each ([Fruit2.Fruit] = null)),
    #"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Fruit"})
in
    #"Removed Other Columns"

This now does fold (meaning performance should be better) and gives you the following SQL:

select [_].[Fruit]
from 
(
    select [$Outer].[Fruit],
        [$Inner].[Fruit2]
    from [dbo].[Fruit1] as [$Outer]
    left outer join 
    (
        select [_].[Fruit] as [Fruit2]
        from [dbo].[Fruit2] as [_]
    ) as [$Inner] on ([$Outer].[Fruit] = [$Inner].[Fruit2] or [$Outer].[Fruit] is null and [$Inner].[Fruit2] is null)
) as [_]
where [_].[Fruit2] is null

IF, SWITCH And The Effect Of DAX Variables On Strict/Eager Evaluation

A lot of DAX performance problems relate to the use of strict and eager evaluation with the IF or SWITCH functions. It’s an incredibly complex, “it depends”, black-box type of topic and naturally Marco and Alberto have a great article on it here. Rather than describe any general rules – which would be pretty much impossible – in this blog post I want to show a specific scenario that illustrates how the use of DAX variables can influence whether strict or eager evaluation takes place.

Consider a Power BI semantic model that has the following three tables:

TableA and TableB are fact tables with numeric measure columns; Choice is a disconnected table containing text values that is intended for use in a slicer, so a report user can select an item in that slicer and that in turn influences what values a measure returns:

Here’s the definition of one such measure:

With Variables =
VAR a =
    SUM ( TableA[A] )
VAR b =
    SUM ( TableB[B] )
RETURN
    IF ( SELECTEDVALUE ( Choice[Choice] ) = "TableA", a, b )
)

And here’s my test report that includes a slicer and a card visual displaying the output of this measure:

Let’s look at the DAX query generated by the card visual containing the measure when “TableA” is selected in the slicer:

DEFINE
    VAR __DS0FilterTable =
        TREATAS ( { "TableA" }, 'Choice'[Choice] )

EVALUATE
SUMMARIZECOLUMNS (
    __DS0FilterTable,
    "With_Variables", IGNORE ( 'Choice'[With Variables] )
)

…and in particular what DAX Studio’s Server Timings feature shows us:

Even though only TableA is selected in the slicer and the query only returns the sum of the values in the A column of TableA, we can see that the Storage Engine is also querying TableB and getting the sum of the B column. It’s a great example of eager evaluation: both branches of the IF are being evaluated. Is this a bad thing? For this particular report it may be if the Storage Engine query for TableB is expensive.

How can you force strict evaluation to take place? You can force eager evaluation using the IF.EAGER function but there is no equivalent function to force strict evaluation. However you maybe be able to rewrite the measure to get strict evaluation to take place.

The key factor in this case is the use of variables in the measure definition. If you rewrite the measure to not use variables, like so:

No Variables =
IF (
    SELECTEDVALUE ( Choice[Choice] ) = "TableA",
    SUM ( TableA[A] ),
    SUM ( TableB[B] )
)

…then the query for the card visual query behaves differently:

DEFINE VAR __DS0FilterTable = 
TREATAS({"TableA"}, 'Choice'[Choice])

EVALUATE
SUMMARIZECOLUMNS(__DS0FilterTable, "No_Variables", IGNORE('Choice'[No Variables]))

Notice that there are now only two Storage Engine queries and that TableB is not now being queried, which will probably result in better performance. This is strict evaluation.

Why does the use of variables result in the use of eager evaluation here? Because it does, and it’s complicated. I need to stress that DAX uses lazy evaluation for variables which means that variables are not evaluated if they are not used – in the first measure above the IF is deliberately evaluating both branches. There are certainly other optimisations that may kick in and result in strict evaluation even when variables are used in IF/SWITCH. Indeed, the two measures in this example being from different fact tables is extremely important: if they had been from the same fact table then the behaviour would have been different and strict evaluation would have been used. In summary, though, if you are using variables and you want to try to force strict evaluation with IF/SWITCH then it’s worth rewriting your code to remove those variables to see if it makes a difference.

I also need to stress that these two measures will perform better or worse depending on how and where they are used. Consider these two table visuals that use the values from the Choice table on rows along with the two measures above:

Running the DAX query for the table on the left, which uses the measure with no variables and strict evaluation, gives the following in DAX Studio’s Server Timings:

Note that there are two Storage Engine queries for TableB, which is not a good thing.

On the other hand, the table on the right which uses the [No Variables] measure gives the following:

Note that there is only one Storage Engine query for TableB now, so in this case the tables have turned and the [With Variables] measure is likely to perform better.

When you use variables in the branches of IF/SWITCH, as in the [With Variables] measure, the variables are evaluated in a context at the place of the variable definition; if you don’t use variables in this way, as in the [No Variables] measure, the two SUM expressions used in the two branches are evaluated in the context of the branch, which adds a hidden filter context corresponding to the condition. This has two consequences which may hurt query performance:

  1. During evaluation, the hidden filter may be materialised into a large in-memory table
  2. Fusion cannot happen across common subexpressions in different branches because they have different contexts

In contrast the use of variables means the expressions used in them are evaluated in a context without the hidden filter corresponding to the branch, which can encourage fusion and discourage the materialisation of large in-memory tables.

[Thanks to Marius Dumitru and Jeffrey Wang for much of the information in this post]