New Limits For The “Maximum Connections Per Data Source” Property In Power BI DirectQuery Mode

One of the most important properties you can set in a Power BI DirectQuery semantic model is the “Maximum connections per data source” property, which controls the number of connections that can be used to run queries against a data source. The good news is that the maximum value that you can set this property to has just been increased in Premium.

This property is important because the number of open connections to a data source acts as a limit on the number of concurrent queries that can be run by the semantic model against the source: each connection can only have one query running on it at any one time. If you have Power BI reports that have a large number of visuals on a page and/or a large number of users running reports at the same time then it is very easy for a DirectQuery semantic model to need to send lots of queries back to your data source at the same time. If some of the queries that your semantic model runs against your data source are slow – more than one or two seconds even – then the number of queries that need to be run at a given time will increase. The same is true if you have increased the Max Parallelism Per Query property to increase the number of parallel queries that can be generated by a single DAX query.

This property is documented in a number of places, including the DirectQuery guidance documentation and in data source-specific best practice documents such as this one for Snowflake. You can set the property in Power BI Desktop in the Current File/DirectQuery section of the Options dialog:

If you are not using Power BI Premium (ie you are using Power BI Shared capacity, also known as Power BI Pro) then the maximum value that you can set this property to is 10. If you are using Power BI Premium then the maximum value up to today was 30 but now that limit has been increased. The table on this page shows what the new limits per SKU are:

As you can see, for a P1/F64 the maximum limit is now 50 rather than 30 and this limit goes all the way up to 200 for a P4/F512 and higher.

I’ve seen plenty of cases where increasing the value of this property makes Power BI reports run a lot faster. However, this will only be true if your data source is able to handle the number of queries that Power BI is trying to run against it. As I showed in this post, if your data source can’t handle the number of queries you’re trying to run then performance will get worse and not better, so you should try different values to see which one works best.

Extracting Power BI Import Mode Job Graph Data To A Table

In my last post but one I showed how you could create a DGML file that contains detailed information on what happens during a Power BI Import mode refresh – the “job graph” – and visualise that in Visual Studio or VS Code, to help with performance tuning. In my last post, I explained the concepts of blocking and waiting which are key to understanding this data. In this post I’ll share and describe a Dataflow template that extracts the data contained in the DGML file to a table for other types of analysis – which turns out to be quite simple, because DGML is an XML-based format which is easy to work with in Power Query and Dataflows.

You can download the .pqt template file here (for more information on how to use templates see here). To use it, create a new Dataflow Gen2 and import the template file. You’ll see the following queries:

Next, change the values of the three parameters to contain the IDs of the refresh whose DGML file you want to load (the notebook in this post that creates the DGML file uses the refresh ID as the file’s name) and the workspace and lakehouse where that file is stored. Finally, set output destinations on either the RefreshJobs and/or RefreshJobsPivoted queries. The RefreshJobs query returns a table with one row per refresh job:

This gives you all the data in the DGML file that was visible using a DGML viewer but in table form. For each job you get the CreatedAt, RunnableAt, StartedAt and FinishedAt datetimes, the Blocked, Wait and Running durations, the Slot ID, the CPU used and whether the job is on the critical path.

The RefreshJobsPivoted query gives you exactly the same data but instead of giving you one row per job, you get three: one row for the blocked, waiting and running phases of each job, with the start and end times for each phase and the duration. This may be a more convenient format for visualisation and analysis:

The Links query gives you the dependencies between the jobs:

Having loaded all this data into a Lakehouse you can now build a report on it. As I said, I’m sure there’s a way of visualising all of this data in a way that shows all the durations and dependencies, but even a basic table report on the RefreshJobs table like this is really useful because it shows you which jobs were blocked, which ones had to wait, and which ones took a long time to run:

In this example (the Dependencies example from my previous post, but where the refresh has maxParallelism set to 1, so there is blocking as well as waiting) you can see that refreshing table X took 60 seconds, that the refresh for table Y had to wait for 60 seconds before it could start and took 10 seconds, and that refresh for table XY was blocked for 70 seconds before it could start. So, if you wanted to make this refresh run faster, you would want to understand why table X was so slow and also look at increasing the amount of parallelism so Y did not have to wait.

Understanding Blocking And Waiting In Power BI Import Mode Refreshes

Following on from my previous post showing how you can visualise the job graph for a Power BI Import mode semantic model refresh, I this post I will look at how you can interpret what the job graph tells you – specifically, explaining the concepts of blocking and waiting. As always, simple examples are the best way of doing this.

Blocking

Blocking occurs when one job can only start when one or more other jobs have completed because there is a dependency between them. Consider the following semantic model:

X and Y are tables that contain a single numeric column. X takes 1 minute to load while Y takes 10 seconds to load (I forced this delay using technique I blogged about here). XYUnion is a DAX calculated table that unions all the rows from X and Y with the following definition:

XYUnion = UNION(X,Y)

As you can imagine, the job that refreshes XYUnion can only start once the jobs that refresh both X and Y have finished; XYUnion will be blocked until both X and Y have refreshed. Here’s what the job graph for the refresh of this semantic model looks like:

At the centre is the job “Process Partition XYUnion[XYUnion]” which refreshes the calculated table XYUnion. The arrows going from this job to the jobs “Process Partition X[X]” and “Process Partition Y[Y]”, which refresh tables X and Y, show that this job depends on those two other jobs.

Hovering over the “Process Partition XYUnion[XYUnion” job shows the following popup:

There are four datetime values here: CreatedAt, RunnableAt, StartedAt and FinishedAt. There are also three durations:

  • Block is the elapsed time between CreatedAt and RunnableAt, and is the amount of time elapsed before all the jobs this job depends on were completed. Anything other than a zero here means that blocking has occurred.
  • Wait is the elapsed time between RunnableAt and StartedAt. A job becomes runnable when all the jobs it depends on have completed but even it still may not be able to start because Power BI only allows a certain number of jobs to refresh in parallel (see this post for more details and how to control the amount of parallelism). Waiting is described in the example below.
  • Run is the elapsed time between StartedAt and FinishedAt and is the amount of time the job itself took to run.

In this case you can see that the value for Block for XYUnion is 1 minute: X and Y have no preceding jobs so they kick off at the same time, X takes 1 minute to run and Y takes 10 seconds, so it is 1 minute before XYUnion can run. The popups for X and Y show 1 minute and 10 seconds respectively for Run, as you would expect:

One last thing to mention: in the full job graph diagram above you’ll see that certain nodes are highlighted in red. That’s because they are on the critical path, which is documented here; it’s the chain of jobs that dictates the overall length of the refresh, so if you want to make your refresh faster then you need to tune the jobs on the critical path. In this case the critical path goes through X to XYUnion: if you wanted the whole semantic model to refresh faster you would need to tune the refresh for either X or XYUnion; tuning the refresh for Y would make no difference to the overall semantic model refresh time.

Waiting

As I already mentioned, there is a limit on the number of jobs within a refresh that can run in parallel. The maximum number of jobs is represented by the number of “slots” – one slot can only run one job at a time – and in the screenshots of the popups above you can see each job has a slot number. If there are more jobs that could be run than there are slots available at a given time then some jobs have to wait for a slot.

Here’s another example: a semantic model with three tables, A, B and C, which each take 30 seconds to refresh.

There are no dependencies between the tables so in theory each of these three tables could refresh in parallel. However if you refresh with maxParallelism set to two then only two of the three can refresh at any one time. Here’s the job graph for a refresh that does that:

As you can see the critical path goes through the refresh jobs for table A, and hovering over the “Process Partition A[A]” job shows the following:

While this job was not blocked at all because there are no jobs it depends on, it had to wait 30 seconds for a slot to become available; it eventually ran in slot 0. Hovering over the nodes for “Process Partition B[B]” and “Process Partition C[C]” shows that they ran in slots 0 and slot 1 respectively and neither of them were blocked or had to wait.

The job graph isn’t always the best way of visualising this type of parallelism; Phil Seamark’s original Power BI report for visualising refreshes has a page which shows the slots and the jobs in them but I think there’s probably a better way of visualising all of this data that shows slots as well as dependencies. Maybe a Deneb visual is the answer? If anyone has ideas I’d be interested to hear them! In any case, the first step to doing this is to extract all of this data from the .DGML file into a table and that’s what I’ll demonstrate how to do in the next post in this series.

Visualising Power BI Import Mode Refresh Job Graphs

A few years ago a new pair of Profiler events was added for Power BI Import mode datasets (and indeed AAS models): the Job Graph events. I blogged about them here but they never got used by anyone because it was extremely difficult to extract useful data from them – you had to run a Profiler trace, save the trace file, run a Python script to generate a .dgml file, then open that file in Visual Studio – which was a shame because they contain a lot of really interesting, useful information. The good news is that with the release of Semantic Link in Fabric and the ability to run Profiler traces from a Fabric notebook it’s now much easier to access Job Graph data and in this blog post I’ll show you how.

Quick recap: what are the Job Graph events and why are they useful? Let’s say you have a Power BI Import mode semantic model and you want to optimise refresh performance. When you refresh a semantic model, that refresh is made up of multiple jobs which themselves are made up of multiple jobs: refreshing a semantic model involves refreshing all the tables in that model, refreshing a table involves refreshing all the partitions in that table, refreshing a partition involves loading the data and building attribute hierarchies, and so on. Some of these jobs can happen in parallel but in some cases there are dependencies between jobs, so one job can only start when another has completed. The Job Graph events give you information on these refresh jobs and the dependencies between them so you can work out which jobs you need to optimise. In order to capture information from them you need to run a trace while the semantic model is being refreshed; the data from some of these Job Graph events can be reconstituted into a Directed Graph Markup Language (DGML) file, which is an XML-based format, and once you’ve got that you can either visualise the DGML file using a suitable viewer or extract the data from it and analyse it further.

[Before I carry on I have to acknowledge that I’m extremely new at Python and a lot of the code in this post is adapted from the code in my colleague Phil Seamark’s excellent recent post on visualising Power BI refresh information with Semantic Link. Any feedback on ways to optimise the code is gratefully received.]

Here’s some Python code that you can use in a Fabric notebook to run a refresh and generate a DGML file. Each code snippet can be used in a separate code cell or combined into a single cell.

First of all you need to install Semantic Link:

%pip install semantic-link

Next you need to define the events you want in your trace, which in this case are just the Job Graph events:

import sempy.fabric as fabric
import pandas as pd
import time
import warnings

base_cols = ["EventClass", "EventSubclass", "TextData", "IntegerData"]

# define events to trace and their corresponding columns

event_schema = {
"JobGraph": base_cols
}

warnings.filterwarnings("ignore")

You then need to start a trace using this definition, refresh the semantic model, stop the trace and filter the events captured so you only have those with the EventSubclass GraphFinished, remove the event which contains the metadata (which has a value of 0 in the IntegerData column) and then finally sort the rows in ascending order by the values in the IntegerData column:

WorkspaceName = "Insert workspace name here"
SemanticModelName = "Insert semantic model name here"

with fabric.create_trace_connection(SemanticModelName,WorkspaceName) as trace_connection:
# create trace on server with specified events
with trace_connection.create_trace(event_schema, "Simple Refresh Trace") as trace:

trace.start()

# run the refresh
request_status_id = fabric.refresh_dataset(SemanticModelName, WorkspaceName, refresh_type="full")
print("Progress:", end="")

while True:
status = fabric.get_refresh_execution_details(SemanticModelName, request_status_id, WorkspaceName).status
if status == "Completed":
break

print("░", end="")
time.sleep(2)

print(": refresh complete")
# allow ending events to collect
time.sleep(5)

# stop Trace and collect logs
final_trace_logs = trace.stop()



# only return GraphFinished events
final_trace_logs = final_trace_logs[final_trace_logs['Event Subclass'].isin(["GraphFinished"])]
# ignore metadata row
final_trace_logs = final_trace_logs[final_trace_logs['Integer Data'].ne(0)]
# sort in ascending order by Integer Data column
final_trace_logs = final_trace_logs.sort_values(by=['Integer Data'], ascending=True)

Finally, you need to take all the text from the EventText column of the remaining events and concatenate it to get the contents of the DGML file and then save that file to the Files section of the Lakehouse attached to your notebook:

# concatenate all text in TextData column
out = ''.join(final_trace_logs['Text Data'])
# change background colour of critical path nodes so it's easier to see in VS Code
out = out.replace("#263238", "#eba0a7")

# write dgml file
dgmlfile = open("/lakehouse/default/Files/" + request_status_id + ".dgml", 'x')
print (out, file=dgmlfile)
dgmlfile.close()

#dispose of trace connection
trace_connection.disconnect_and_dispose()

I found a nice Visual Studio Code extension called DGMLViewer which makes viewing DGML files easy. Rather than manually downloading the file, OneLake Explorer makes it easy to sync files in OneLake with your PC in a very similar way to OneDrive, which makes working with these DGML files in VS Code very straightforward because you can simply open the local copy when it syncs.

Here’s what one of thse DGML files, generated from the refresh of a very basic semantic model, looks like when viewed in DGML Viewer:

If you have Visual Studio you can also use it to view DGML files (you need to install the DGML Editor first); I found a VS extension called DgmlPowerTools 2022 which adds some advanced features. Here’s what a DGML file for a refresh looks like when visualised in Visual Studio 2022:

OK, so this looks cool but it also looks very complicated. What does it all mean? How can you interpret all this information and use it to optimise a refresh? That’s something for a future blog post!

[In my next post I look at how you can interpret this data and understand the concepts of blocking and waiting, and in the post after that show how you can extract the data in this DGML file to a table using a Dataflow]

Incremental Refresh On Delta Tables In Power BI

One of the coolest features in Fabric is Direct Lake mode, which allows you to build Power BI reports directly on top of Delta tables in your data lake without having to wait for a semantic model to refresh. However not everyone is ready for Fabric yet so there’s also a lot of interest in the new DeltaLake.Table M function which allows Power Query (in semantic models or dataflows) to read data from Delta tables. If you currently have a serving layer – for example Synapse Serverless or Databricks SQL Warehouse – in between your existing lake house and your import mode Power BI semantic models then this new function could allow you to remove it, to reduce complexity and cut costs. This will only be a good idea, though, if refresh performance isn’t impacted and incremental refresh can be made to work well.

So is it possible to get good performance from DeltaLake.Table with incremental refresh? Query folding isn’t possible using this connector because there’s no database to query: a Delta table is just a folder with some files in. But query folding isn’t necessary for incremental refresh to work well: what’s important is that when Power Query filters a table by the datetime column required for incremental refresh, that query is significantly faster than reading all the data from that table. And, as far as I can see from the testing I’ve done, because of certain performance optimisations within DeltaLake.Table it should be possible to use incremental refresh on a Delta table successfully.

There are three factors that influence the performance of Power Query when querying a Delta table:

  1. The internal structure of the Delta table, in particular whether it is partitioned or not
  2. The implementation of the connector, ie the DeltaLake.Table function
  3. The M code you write in the queries used to populate the tables in your semantic model

There’s not much you can do about #2 – performance is, I think, good enough right now although there are a lot of optimisations that will hopefully come in the future – but #1 and #3 are definitely within your control as a developer and making the right choices makes all the difference.

Here’s what I did to test incremental refresh performance. First, I used a Fabric pipeline to load the NYC Taxi sample data into a table in a Lakehouse (for the purposes of this exercise a Fabric Lakehouse will behave the same as ADLSgen2 storage – I used a Lakehouse because it was easier). Then, in Power BI Desktop, I created an import mode semantic model pointing to the NYC taxi data table in the Lakehouse and configured incremental refresh. Here’s the M code for that table:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/unpartitionednyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
#"Filtered Rows" = Table.SelectRows(
ToDelta,
each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd
)
in
#"Filtered Rows"

Here’s the incremental refresh dialog:

I then published the semantic model and refreshed it via the Enhanced Refresh API from a notebook (Semantic Link makes this so much easier) using an effective date of 8th December 2013 to get a good spread of data. I used Phil Seamark’s new, notebook-based version of his refresh visualisation tool to see how long each partition took during an initial refresh:

The refresh took just over 30 minutes.

Next, using Spark SQL, I created a copy of the NYC taxi data table in my Lakehouse with a new datetime column added which removed everything apart from the date and I then partitioned the table by that new datetime column (called PickupDate here):

CREATE TABLE PartitionedByDateNYC

USING delta
PARTITIONED BY (PickupDate)
AS
SELECT *, date_trunc("Day", lpepPickupDateTime) as PickupDate
FROM NYCIncrementalRefreshTest.nyctaxi_raw

I created a copy of my semantic model, pointed it to the new table and reconfigured the incremental refresh to filter on the newly-created PickupDate column:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionedbydatenyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
#"Filtered Rows" = Table.SelectRows(
ToDelta,
each [PickupDate] >= RangeStart and [PickupDate] < RangeEnd
)
in
#"Filtered Rows"

…and refreshed again. This time the refresh took about 26 seconds.

Half an hour to 26 seconds is a big improvement and it’s because the DeltaLake.Table function is able to perform partition elimination: the partitions in the semantic model align to one or more partitions in the Delta table, so when each partition in the semantic model is refreshed Power Query only needs to read data from the partitions in the Delta table that contain the relevant data. This only happens because the filter in the Power Query query using the RangeStart and RangeEnd parameters is on the same column that is used to partition the Delta table.

In my final test I partitioned my Delta table by month, like so:

CREATE TABLE PartitionedNYC

USING delta
PARTITIONED BY (PickupYearMonth)
AS
SELECT *, (100*date_part('YEAR', lpepPickupDateTime)) + date_part('Months', lpepPickupDateTime) as PickupYearMonth
FROM NYCIncrementalRefreshTest.nyctaxi_raw

The challenge here is that:

  1. The new PickupYearMonth column is an integer column, not a datetime column, so it can’t be used for an incremental refresh filter in Power Query
  2. Power BI incremental refresh creates partitions at the year, quarter, month and date granularities, so filtering by month can’t be used for date partitions

I solved this problem in my Power Query query by calculating the month from the RangeStart and RangeEnd parameters, filtering the table by the PickupYearMonth column (to get partition elimination), stopping any further folding using the Table.StopFolding function and then finally filtering on the same datetime column I used in my first test:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionednyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
YearMonthRangeStart = (Date.Year(RangeStart) * 100) + Date.Month(RangeStart),
YearMonthRangeEnd = (Date.Year(RangeEnd) * 100) + Date.Month(RangeEnd),
FilterByPartition = Table.StopFolding(
Table.SelectRows(
ToDelta,
each [PickupYearMonth] >= YearMonthRangeStart and [PickupYearMonth] <= YearMonthRangeEnd
)
),
#"Filtered Rows" = Table.SelectRows(
FilterByPartition,
each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd
)
in
#"Filtered Rows"

Interestingly this table refreshed even faster: it took only 18 seconds.

This might just be luck, or it could be because the larger partitions resulted in fewer calls back to the storage layer. The AzureStorage.DataLake M function requests data 4MB at a time by default and this could result in more efficient data retrieval for the data volumes used in this test. I didn’t get round to testing if using non-default options on AzureStorage.DataLake improved performance even more (see here for more details on earlier testing I did with them).

To sum up, based on these tests it looks like incremental refresh can be used effectively in import mode semantic models with Delta tables and the DeltaLake.Table function so long as you partition your Delta table and configure your Power Query queries to filter on the partition column. I would love to hear what results you get if you test this in the real world so please let me know by leaving a comment.

Getting Different Versions Of Data With Value.Versions In Power Query

Something I mentioned in my recent post about the new DeltaLake.Tables M function on the Fabric blog recently was the fact that you can get different versions of the data held in a Delta table using the Value.Versions M function. In fact, the Value.Versions is the way to access different versions of data in any source that has this concept – so long as Power Query has added support for doing so. The bad news is that, at least at the the time of writing, apart from the DeltaLake connector there’s only one other source where Value.Versions can be used in this way: the connector for Fabric Lakehouses.

Here’s how you can access the data in a table using the Lakehouse.Contents M function:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data]
in
SelectTable

As with DeltaLake.Table, you can get a table with all the different versions available using Value.Versions:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
ShowVersions = Value.Versions(SelectTable)
in
ShowVersions

Version 0 is the earliest version; the latest version of the data is the version with the highest number and this version can also be accessed from the row with the version number null. The nested values in the Data column are tables which give you the data for that particular version number. So, for example, if I wanted to get the data for version 2 I could click through on the nested value in the Data column in the row where the Version column contained the value 2. Here’s the M code for this:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
ShowVersions = Value.Versions(SelectTable),
Data = ShowVersions{2}[Data]
in
Data

The Lakehouse connector uses the TDS Endpoint of the Lakehouse to get data by default, as in the first code snippet above, but if you use Value.Versions to get specific versions then this isn’t (as yet) possible so it will use a slower method to get data and performance may suffer.

Last of all, you can get the version number of the data you’re looking at using the Value.VersionIdentity function. If you’re looking at the latest version of the data then Value.VersionIdentity will return null:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
GetVersion = Value.VersionIdentity(SelectTable)
in
GetVersion

If you are looking at version 2 of the data then Value.VersionIdentity will return 2:

let
Source = Lakehouse.Contents(),
SelectWorkspace = Source{[workspaceId = "insertworkspaceid"]}[Data],
SelectLakehouse = SelectWorkspace{[lakehouseId = "insertlakehouseid"]}[Data],
SelectTable = SelectLakehouse{[Id = "nested_table", ItemKind = "Table"]}[Data],
ShowVersions = Value.Versions(SelectTable),
GetVersion2 = ShowVersions{2}[Data],
GetVersionNumber = Value.VersionIdentity(GetVersion2)
in
GetVersionNumber

Anti Joins And Query Folding In Power Query In Excel And Power BI

Power Query allows you to merge (“join” in database terms) two tables together in a variety of different ways, including left and right anti joins. Unfortunately, as I found recently, anti joins don’t fold on SQL Server-related data sources, which can result in performance problems. Luckily there is a different way of doing anti joins that does fold.

Say you have two Power Query queries called Fruit1 and Fruit2 that get data from SQL Server tables containing the names of different varieties of fruit:

Now, let’s say you want to get a list of all the fruit varieties that are in Fruit1 and not in Fruit2. The obvious thing to do is to do a Merge and use the Left Anti option like so:

Here’s the M code, including an extra step to remove the join column that this creates:

let
    Source = Table.NestedJoin(Fruit1, {"Fruit"}, Fruit2, {"Fruit"}, "Fruit2", JoinKind.LeftAnti),
    #"Removed Other Columns" = Table.SelectColumns(Source,{"Fruit"})
in
    #"Removed Other Columns"

This gives you the correct result:

…but it doesn’t fold, so performance may be bad.

However, if you do a Merge and use the Left Outer option instead:

Then expand the join column (called Fruit2.Fruit here):

And then filter on that column so you only keep the rows where it contains the value null, and then remove that column, you get the same result:

Here’s the M:

let
    Source = Table.NestedJoin(Fruit1, {"Fruit"}, Fruit2, {"Fruit"}, "Fruit2", JoinKind.LeftOuter),
    #"Expanded Fruit2" = Table.ExpandTableColumn(Source, "Fruit2", {"Fruit"}, {"Fruit2.Fruit"}),
    #"Filtered Rows" = Table.SelectRows(#"Expanded Fruit2", each ([Fruit2.Fruit] = null)),
    #"Removed Other Columns" = Table.SelectColumns(#"Filtered Rows",{"Fruit"})
in
    #"Removed Other Columns"

This now does fold (meaning performance should be better) and gives you the following SQL:

select [_].[Fruit]
from 
(
    select [$Outer].[Fruit],
        [$Inner].[Fruit2]
    from [dbo].[Fruit1] as [$Outer]
    left outer join 
    (
        select [_].[Fruit] as [Fruit2]
        from [dbo].[Fruit2] as [_]
    ) as [$Inner] on ([$Outer].[Fruit] = [$Inner].[Fruit2] or [$Outer].[Fruit] is null and [$Inner].[Fruit2] is null)
) as [_]
where [_].[Fruit2] is null

IF, SWITCH And The Effect Of DAX Variables On Strict/Eager Evaluation

A lot of DAX performance problems relate to the use of strict and eager evaluation with the IF or SWITCH functions. It’s an incredibly complex, “it depends”, black-box type of topic and naturally Marco and Alberto have a great article on it here. Rather than describe any general rules – which would be pretty much impossible – in this blog post I want to show a specific scenario that illustrates how the use of DAX variables can influence whether strict or eager evaluation takes place.

Consider a Power BI semantic model that has the following three tables:

TableA and TableB are fact tables with numeric measure columns; Choice is a disconnected table containing text values that is intended for use in a slicer, so a report user can select an item in that slicer and that in turn influences what values a measure returns:

Here’s the definition of one such measure:

With Variables =
VAR a =
    SUM ( TableA[A] )
VAR b =
    SUM ( TableB[B] )
RETURN
    IF ( SELECTEDVALUE ( Choice[Choice] ) = "TableA", a, b )
)

And here’s my test report that includes a slicer and a card visual displaying the output of this measure:

Let’s look at the DAX query generated by the card visual containing the measure when “TableA” is selected in the slicer:

DEFINE
    VAR __DS0FilterTable =
        TREATAS ( { "TableA" }, 'Choice'[Choice] )

EVALUATE
SUMMARIZECOLUMNS (
    __DS0FilterTable,
    "With_Variables", IGNORE ( 'Choice'[With Variables] )
)

…and in particular what DAX Studio’s Server Timings feature shows us:

Even though only TableA is selected in the slicer and the query only returns the sum of the values in the A column of TableA, we can see that the Storage Engine is also querying TableB and getting the sum of the B column. It’s a great example of eager evaluation: both branches of the IF are being evaluated. Is this a bad thing? For this particular report it may be if the Storage Engine query for TableB is expensive.

How can you force strict evaluation to take place? You can force eager evaluation using the IF.EAGER function but there is no equivalent function to force strict evaluation. However you maybe be able to rewrite the measure to get strict evaluation to take place.

The key factor in this case is the use of variables in the measure definition. If you rewrite the measure to not use variables, like so:

No Variables =
IF (
    SELECTEDVALUE ( Choice[Choice] ) = "TableA",
    SUM ( TableA[A] ),
    SUM ( TableB[B] )
)

…then the query for the card visual query behaves differently:

DEFINE VAR __DS0FilterTable = 
TREATAS({"TableA"}, 'Choice'[Choice])

EVALUATE
SUMMARIZECOLUMNS(__DS0FilterTable, "No_Variables", IGNORE('Choice'[No Variables]))

Notice that there are now only two Storage Engine queries and that TableB is not now being queried, which will probably result in better performance. This is strict evaluation.

Why does the use of variables result in the use of eager evaluation here? Because it does, and it’s complicated. I need to stress that DAX uses lazy evaluation for variables which means that variables are not evaluated if they are not used – in the first measure above the IF is deliberately evaluating both branches. There are certainly other optimisations that may kick in and result in strict evaluation even when variables are used in IF/SWITCH. Indeed, the two measures in this example being from different fact tables is extremely important: if they had been from the same fact table then the behaviour would have been different and strict evaluation would have been used. In summary, though, if you are using variables and you want to try to force strict evaluation with IF/SWITCH then it’s worth rewriting your code to remove those variables to see if it makes a difference.

I also need to stress that these two measures will perform better or worse depending on how and where they are used. Consider these two table visuals that use the values from the Choice table on rows along with the two measures above:

Running the DAX query for the table on the left, which uses the measure with no variables and strict evaluation, gives the following in DAX Studio’s Server Timings:

Note that there are two Storage Engine queries for TableB, which is not a good thing.

On the other hand, the table on the right which uses the [No Variables] measure gives the following:

Note that there is only one Storage Engine query for TableB now, so in this case the tables have turned and the [With Variables] measure is likely to perform better.

When you use variables in the branches of IF/SWITCH, as in the [With Variables] measure, the variables are evaluated in a context at the place of the variable definition; if you don’t use variables in this way, as in the [No Variables] measure, the two SUM expressions used in the two branches are evaluated in the context of the branch, which adds a hidden filter context corresponding to the condition. This has two consequences which may hurt query performance:

  1. During evaluation, the hidden filter may be materialised into a large in-memory table
  2. Fusion cannot happen across common subexpressions in different branches because they have different contexts

In contrast the use of variables means the expressions used in them are evaluated in a context without the hidden filter corresponding to the branch, which can encourage fusion and discourage the materialisation of large in-memory tables.

[Thanks to Marius Dumitru and Jeffrey Wang for much of the information in this post]

Thoughts On Power BI Datasets Being Renamed To Semantic Models

Last week it was announced that Power BI datasets have been renamed: they are now semantic models. You can read the announcement blog post here and see the change in the Fabric/Power BI UI already.

The name change proved to be surprisingly uncontroversial. Of course it’s very disruptive – trust me, I know, I have around 500 blog posts that I need to do a search-and-replace on at some point – so I have a lot of sympathy for people with books or training courses that need updating or who are getting calls from confused end users who are wondering where their datasets have gone. But there was a general consensus that the change was the right thing to do:

When Marco approves of a change the whole Fabric team breathes a sigh of relief. The term “dataset” is too generic and too confusing for new developers; “semantic model” is a lot more specific and descriptive. Kurt Buhler has just written a very detailed post on what semantic models are. What else is there to say?

A name is often not just a name, it’s a statement of intent. While I don’t want you to read too much into the name change (Christian Wade does a good job of explaining how and why the name “semantic model” was chosen at the start of this Ignite session) and it’s always a mistake to think that we at Microsoft have some elaborate secret master plan for our products’ future development, people are nevertheless asking what the name “semantic model” signifies:

…and when someone as senior as Amir Netz asks me to do something, it’s probably a good idea for me to oblige 😉:

Power BI as a semantic layer is certainly one of my favourite topics: I wrote a very popular post on it last year. Even if it isn’t immediately apparent, Power BI is a semantic layer, a semantic layer made up of one or more semantic models. A lot of things (not just names) have changed in the world of Microsoft BI since I wrote that post which, in my opinion, only strengthen my arguments.

However you define the term “semantic layer”, reusability of data and business logic is a key feature. We all know that Bad Things happen to companies like the one discussed here on Reddit which create one semantic model per report: source systems are overloaded by the number of refreshes, the burden of maintenance becomes overwhelming and there are multiple versions of the truth. Creating the minimum number of semantic models necessary and using them as the source for your reports has always been a best practice in Power BI and the new name will, I hope, prompt developers to think about doing this more.

Would Power BI be better if it forced all developers to build their semantic layer upfront? No, I don’t think so. I believe a good BI tool gives you the flexibility to use it however you like so long as it can be used in the right way if you want – where “right” will mean different things for different organisations. If Power BI was more prescriptive and made you to do the “right” thing up front then I doubt the company discussed on Reddit in the link above would be more successful; instead it would add so many barriers to getting started they probably wouldn’t be using Power BI in the first place, they would be using Excel or some other tool in an equally inefficient way. What’s more if Power BI chose one “right” way of doing things it might exclude other “right” ways doing things, which would alienate the adherents of those other ways and be commercially damaging.

Fabric provides several new opportunities for reuse, with shortcuts and Direct Lake mode as the most obvious examples. Think about the number of Import mode semantic models you have in your organisation: each one will have a Date dimension table for sure, and there will certainly be a lot of dimension tables and probably a few fact tables duplicated across them. How much time and CPU is spent refreshing each of these tables? How many different versions of these tables are there, each one refreshed at different times? In Fabric you can maintain a single physical copy of your shared dimension tables and fact tables in Delta format in a Lakehouse, load data into them once, then reuse them in as many semantic models as you want via shortcuts. With Direct Lake mode no further refresh is needed, so each semantic model reuses the same copy of each dimension table and fact table and shows exactly the same data, saving time and compute and making them all consistent with each other. You can even now sync the tables in your Import mode semantic models to OneLake, making this pattern easier to adopt for existing Power BI users.

Another cause of data duplication in the past has been the different toolsets used by BI professionals and data scientists. Data is modelled and loaded for Power BI reports and business logic coded in DAX by the BI professionals, while in parallel data scientists have taken their own copies of the raw data, modelled it differently and implemented business logic in their own way in languages like Python. As Sandeep Pawar points out here, Semantic Link in Fabric now allows data scientists to query semantic models in SQL or in code, again promoting reuse and consistency.

Finally, looking ahead, I think the new Power BI Desktop Developer mode, Git integration and Tabular Model Definition Language (TMDL) will provide new ways of sharing and reusing business logic such as measure definitions between multiple semantic models. Not all the features necessary to do this are in Power BI/Fabric yet but when they do appear I’m sure we’ll see the community coming up with new patterns (perhaps successors to Michael Kovalsky’s Master Model technique) and external tools to support them.

In conclusion, as Power BI evolves into a part of something bigger with Fabric, then the new features I’ve mentioned here make it an even more mature semantic layer. Changing the name of datasets to semantic models is a way of highlighting this.

Excel Workbook Layout And The Performance Of Reading Data With Power Query In Power BI

Excel workbooks are one of the slowest data sources you can use with Power Query in Excel or Power BI. Reading small amounts of data from small workbooks is usually fast; reading large amounts of data from large workbooks can be very slow. But what about reading small amounts of data from large Excel workbooks? I did some tests and it turns out that performance can vary a lot depending on where your data is in the workbook and how that workbook is structured.

[Note: in this post I’ll be looking at .xlsx files, rather than other Excel formats like .xls and .xlsb; Excel files stored on a local disk and accessed via the File.Contents M function rather than stored in SharePoint or any other location; data read from Excel tables rather than direct from the worksheet; and Power Query in Power BI. Other scenarios may perform differently.]

Let’s see a simple example to illustrate what I found out. I created a new Excel workbook with one worksheet in and put a small table of data on it:

At this point the workbook’s size was 11KB. I then opened Power BI Desktop and created a Power Query query that read this table of data from the Excel workbook:

let
    Source = Excel.Workbook(File.Contents("C:\MyWorkbook.xlsx"), null, true),
    Sales_Table = Source{[Item="Sales",Kind="Table"]}[Data],
    #"Changed Type" = Table.TransformColumnTypes(Sales_Table,{{"Product", type text}, {"Sales", Int64.Type}})
in
    #"Changed Type"

Then I used this technique to measure how long it took to load the data from Excel into Power BI. Unsurprisingly, it was extremely fast: 63ms.

Then I added a new worksheet to the workbook, copied the same table onto it, added a large amount of random numbers underneath using the following Excel formula, and then copied and pasted the values returned by the formula over the output of the formula:

=RANDARRAY(9999,300)

Doing this meant the size of the workbook grew to 43MB. I then created a new Power Query query in Power BI Desktop, identical to the one above except that it connected to the new table. This time the query took 4918ms – almost 5 seconds.

Interestingly, even with the second worksheet with all the data on was added, the first query above (on the worksheet with no other data on) was still fast. I also tested refreshing a Power BI dataset that connected to two identical small tables on different worksheets in the same workbook, both with large amounts of other data on as in the second scenario above, and the performance of both queries was only slightly slower: it was clear two Power Query queries can read data from the same Excel workbook in parallel.

So: reading a small amount of data from a table on a worksheet with a large amount of other data on it is very slow.

What can we learn from this? Well, if you can influence the structure and layout of the Excel workbooks you are using as a data source – and that’s a big if, because in most cases you can’t – and you only need to read some of the data from them, you should put the tables of data you are using as a source on separate worksheets and not on the same worksheet as any other large ranges or tables of data.

It turns out that when the Power Query Excel connector reads data from an .xlsx file it can deserialise just some of the data in it rather than the whole thing, but what it can and can’t avoid deserialising depends a lot on the structure of the workbook and how the data is stored within the workbook .xlsx file. If you’re quick you can even see how much data is being read in Power BI Desktop in the refresh dialog:

You can also use Process Monitor, as I describe here, to see how much data is being read from any file used by Power Query.

Performance also depends on which application generated the .xlsx file (it’s not just Excel that creates .xlsx files, because other applications export data to .xlsx format without using Excel) or even which version of Excel saved the .xlsx file. This is because the same data can be stored in an .xlsx file in different ways, some of which may be more efficient to read than others. I found this blog post by Brendan Long on the .xlsx file format was very clear and it helped me understand how Power Query might go about reading data from an .xlsx file.

[Thanks to Curt Hagenlocher of the Power Query team for answering some of my questions relating to this post]