Power BI And Support For Third Party Semantic Models

I’ve been working with Microsoft BI tools for 28 years now and for all that time Microsoft has been consistent in its belief that semantic models are a good thing. Fashions have changed and at different times the wider BI industry has agreed and disagreed with this belief; right now, semantic models are cool again because everyone has realised how important they are for AI. As a result, some of Microsoft’s partners and competitors (and sometimes it’s not clear which is which) have invested in building their own semantic models and/or metrics stores, some of which don’t work at all with Power BI, some of which only work with significant limitations, and a very small number which are fully supported and work with only minor limitations. This naturally raises the question of whether Power BI will ever work properly with any or all of them. The answer is no, and in this blog post I’ll explain why.

The first thing to make clear is that the reasons why some semantic models work well with Power BI and others don’t are purely technical. It is not because Microsoft has some grand plan to stifle competing BI tools. If you look at Fabric as a whole, you’ll see that Microsoft works closely with Databricks, Snowflake, DBT and many other companies to ensure that it integrates closely with them and gives customers the option to work with whichever other tools they want to use. In Power BI there are connectors to a wide range of data sources, not just Microsoft ones. Over the last year the Power BI team has spoken to all major vendors of third-party semantic models about integration with Power BI and it has been clear about what is and isn’t technically feasible. The door remains open for future collaboration and Microsoft respects the motives of these other vendors, in particular those who are developing open standards.

To understand the technical issues, let’s look at the architecture of a simple Power BI solution that uses an Import mode semantic model – as the vast majority of Power BI solutions do:

In this case the data from the data sources is copied into the Power BI semantic model, which also contains information on how the different tables of data should be joined to each other, measures (defined in the DAX language) describing how data should be aggregated and how more complex business calculations should be performed, which columns are visible and which ones are hidden, and a lot more. When the Power BI report is rendered it sends queries, again in the DAX language, to the semantic model to get the data it needs for each visual.

How could a third-party semantic model be used instead here? Power BI reports connect to Power BI semantic models using the XMLA protocol, and that means that Power BI reports can also connect to older Azure Analysis Services and SQL Server Analysis Services semantic models too. Some vendors have come up with a solution whereby they implement support for XMLA and tell their customers to connect to their semantic models using the SQL Server Analysis Services connector. This works up to a point but as you can imagine, using the SQL Server Analysis Services connector to connect to something that isn’t SQL Server Analysis Services is not supported and not wholly reliable.
It’s worth noting that using a third-party semantic model as a data source for an Import mode Power BI semantic model is not an option either because if Power BI imports metrics like percentage shares or time intelligence calculations it will not be able to aggregate data and get the correct result. Most metrics need to be calculated after the base data has been aggregated to work properly.

There are two other storage modes available for Power BI semantic models: Direct Lake and DirectQuery. Direct Lake only works with data stored in, or which can be reached via a shortcut from, Fabric OneLake so we don’t need to discuss it here. In DirectQuery mode the Power BI semantic model doesn’t store any data and instead, when it is queried, it generates SQL queries to get the data it needs from a data source on demand.

Other vendors of third-party semantic models have taken the approach of suggesting the use of Power BI in DirectQuery mode and having it run SQL against their semantic model. Apart from the fact that DirectQuery mode is usually slower and less cost-effective than Import mode or Direct Lake mode, your first reaction to this would probably be that putting one semantic model on top of another semantic model doesn’t make any architectural sense and you’d be right. There are several serious problems that emerge when you try to use Power BI in this way.

For example, a Power BI semantic model assumes that you have your data modelled as a star schema and that it will be able to generate SQL that joins dimension tables to fact tables. Not all third-party semantic models support something as basic as this yet. What’s more a Power BI semantic model assumes that it will be where all metrics will be calculated, which means that despite some interesting workarounds by third-party vendors (such as making the SQL SUM() function not actually sum up values) you can never be sure that you’ll get the correct values for a metric defined in a third-party semantic model, for example for subtotals or grand totals. There are a lot of other, similar problems that the Power BI team have made these third-party semantic model vendors aware of. These problems are not specific to Power BI semantic models either: no other semantic model would work well with another semantic model as its source.

If you can’t use Power BI semantic models on top of third-party semantic models, is it an option to synchronise calculations defined in a third-party semantic model to a Power BI semantic model? Yes, that is certainly possible and supported, and some of our partners (such as our friends at Tabular Editor) have already started down this path. DAX is a very rich language for defining metrics and Microsoft has invested a lot recently in making changes to Power BI semantic models programmatically as easy as possible. Without a doubt any metrics defined in a third-party semantic model can be reproduced in DAX, although since DAX is a much better fit for defining metrics than SQL you’ll probably find that some of the metrics you need can only be defined in DAX. In which case, rather than defining some of your metrics in a third-party semantic model and some in a Power BI semantic model, why not define all of them in your Power BI semantic model?

The final point to make is that Power BI semantic models can be used with a wide range of BI tools, not just Power BI reports. Apart from Microsoft tools like Excel and Fabric Paginated Reports, Tableau and several other non-Microsoft tools that you might think of as competitors to Power BI can also be used as a front-end for Power BI semantic models and this is supported. There is nothing stopping other BI tools from implementing connectivity to Power BI semantic models in the future. In Fabric you can even query a Power BI semantic model in SQL and extract data into a Pandas Dataframe in Python using the Semantic Link library. Anyone arguing that Power BI semantic models are somehow not “open” is wrong.

I’ll be honest, I think a lot of the reason why organisations that already use Power BI extensively consider third-party semantic models is because some people – not the Power BI users themselves, often people from a data engineering or database background – think of Power BI as just a visualisation tool and don’t realise that it also has the most mature, capable, widely used semantic model available in the market today. It is designed for both self-service and enterprise BI scenarios. Microsoft has no plans to make Power BI’s front end work properly with anything other than its own semantic models because that would be a huge amount of work with few benefits to customers: these third-party semantic models all behave differently and are at different levels of maturity, so any changes made in Power BI to accommodate them would risk breaking existing functionality or limit the use of advanced features. 35 million users view Power BI reports every month and those users query 20 million Power BI semantic models. Microsoft’s strategy is to continue to invest and strengthen Power BI semantic models for those customers. So if Power BI is how you want your end users to consume data, then Power BI semantic models, not any other third-party semantic model or metrics store, are the right place to store your metrics definitions and your business logic.

Role-Playing Dimensions In Fabric Direct Lake Semantic Models Revisited

Back in September 2024 I wrote a blog post on how to create multiple copies of the same dimension in a Direct Lake semantic model without creating copies of the underlying Delta table. Not long after that I started getting comments that people who tried following my instructions were getting errors, and while some bugs were fixed others remained. After asking around I have a workaround (thank you Kevin Moore) that will avoid all those errors, so while we’re waiting for the remaining fixes here are the details of the workaround.

Let’s say you have a Direct Lake on OneLake semantic model with two tables, a fact table called Conversation and a dimension table called Person. The Conversation fact table has one row for a conversation between two people, but at this point there is only one Person dimension in the model with a relationship from the FromPersonId column on Conversation to the PersonId column on Person:

How can you add a second copy of the Person dimension table without duplicating the data in OneLake?

In Power BI Desktop, while editing the semantic model, go to TMDL View and in the Data pane on the right hand side switch to the Model pane:

Expand Expressions and drag it into the TMDL pane to script it out. It should look something like this:

Then you need to make two changes:

On the line that starts “expression”, line 3 in the screenshot above, change the name of the expression to something new and unique
Delete the line that contains the lineage tag, line 8 in the screenshot above

Here’s what it should look like after:

Click Apply and this will create a duplicate Expression in the model. This is the trick that makes everything else work.

Next, create a new script in TMDL View and drag the Person dimension into it to script it out.

Then make the following changes to the script:

On the line that starts “table”, line 3 in the screenshot above, change the name of the table to something new and unique
One the line that starts “expressionSource”, line 31 in the screenshot above, change the name of the source expression to that of the new Expression created earlier
Delete all lines with lineage tags, ie those that start “lineageTag”
Add one line at the end with the text “changedProperty = Name”

Here’s what it should look like after:

Click Apply and this will create a copy of the dimension in the semantic model.

Then, back in Diagram View, you’ll see the new dimension table but with a warning saying that it hasn’t been refreshed. The next step is to refresh the model using the Schema and Data option:

At this point the new dimension table can be used like any other table, so you can create the relationship between the ToPersonId column on Conversation and the PersonId column on the new ToPerson dimension:

Generating Excel Reports With Fabric Dataflows Gen2

So many cool Fabric features get announced at Fabcon that it’s easy to miss some of them. The fact that you can now not only generate Excel files from Fabric Dataflows Gen2, but that you have so much control over the format that you can use this feature to build simple reports rather than plain old data dumps, is a great example: it was only mentioned halfway through this blog post on new stuff in Dataflows Gen2 Nonethless it was the Fabcon feature announcement that got me most excited. This is because it shows how Fabric Dataflows Gen2 have gone beyond being just a way to bring data into Fabric and are now a proper self-service ETL tool where you can extract data from a lot of different sources, transform it using Power Query, and load it to a variety of destinations both inside Fabric and outside it (such as CSV files, Snowflake and yes, Excel).

The documentation for the new Excel destination, which you can find here, is extremely detailed indeed so I thought it would be useful to show a simple example of how you can now use Dataflows Gen2 to build an Excel report. First of all I created a query using the Enter Data source that returned a table with some sales data in:

			
let
  Source = Table.FromRows(
    Json.Document(
      Binary.Decompress(
        Binary.FromText(
          "i45WciwoyEktVtJRMlWK1YlW8i9KzEsH8w0NwAIBqYlFYK4RmOtelFgAkbZUio0FAA==",
          BinaryEncoding.Base64
        ),
        Compression.Deflate
      )
    ),
    let
      _t = ((type nullable text) meta [Serialized.Text = true])
    in
      type table [Product = _t, Sales = _t]
  ),
  #"Changed column type" = Table.TransformColumnTypes(
    Source,
    {{"Product", type text}, {"Sales", Int64.Type}}
  )
in
  #"Changed column type"

		

I then created a query called ReportTitle that contained the text for my report’s title:

			
let
  Source = #table({"Title"},{{"My Sales Report"}})
in
  Source

…and a query called FruitSalesOverview that passes the data from the FruitSales query to the FabricAI.Prompt M function to generate a text summary of it:

			
let
  Source  = FabricAI.Prompt("Summarise the fruit sales data in 20 words or less", FruitSales),
  ToTable = #table({"Summary"}, {{Source}})
in
  ToTable

		

The last query I created, called Output, generated a navigation table in the format described in the docs to describe the Excel output: the report title in a range starting in cell B1, the report summary in a range starting in cell B3, the sales data in a table starting in cell B5 and a bar chart showing the sales data.

			
let
  excelDocument = #table(
    type table [
      Sheet      = nullable text,
      Name       = nullable text,
      PartType   = nullable text,
      Properties = nullable record,
      Data       = any
    ],
    {
      // Report title
      {"Sales", "Title", "Range", [StartCell = "B1", SkipHeader = true], ReportTitle},
      // Copilot-generated summary
      {"Sales", "Summary", "Range", [StartCell = "B3", SkipHeader = true], FruitSalesOverview},
      // Table containing sales data
      {
        "Sales",
        "SalesTable",
        "Table",
        [StartCell = "B5", TableStyle = "TableStyleMedium9"],
        FruitSales
      },
      //Column chart containing sales data
      {
        "Sales",
        "SalesChart",
        "Chart",
        [
          ChartType  = "Column",
          ChartTitle = "Fruit Sales",
          DataSeries = [AxisColumns = {"Product"}, ValueColumns = {"Sales"}]
        ],
        #table({}, {}) meta [Name = "SalesTable"]
      }
    }
  )
in
  excelDocument

		

I then set the Data Destination of the Output query to use the New File option to create an Excel file and save it to the Files section of a Fabric Lakehouse. The use of the Advanced format option meant that the navigation table returned by the Output query was used to determine the structure of the resulting Excel file.

After refreshing the Dataflow I downloaded the resulting Excel workbook from my Lakehouse. Here’s what it looked like:

Pretty fun. Does it give you full control over the format of the Excel file? No, not quite. Is it a somewhat code-heavy approach? Yes, but I suppose in the age of AI that doesn’t matter so much since you’re unlikely to write your own code (although, being old-school, I adapted the code above from the docs manually). Most importantly: is this a better way of dumping data to Excel and/or generating simple Excel reports in Fabric than paginated reports? Good question, especially since Power Query is now available in paginated reports. I suspect the answer is that although paginated reports are harder to build (though you can generate rdl with AI too, I’ve done it) and that it’s harder to control what a paginated report rendered as an Excel file looks like, paginated reports may still have the edge if you want an Excel report but I’m not sure – factors like CU cost and how long it takes to generate an Excel file using each approach would also need to be taken into account. If you just want to dump data to Excel, however, Dataflows Gen2 are probably a better option now.

Query Folding And Staging In Fabric Dataflows Gen2

A few years ago I wrote this post on the subject of staging in Fabric Dataflows Gen2. In it I explained what staging is, how you can enable it for a query inside a Dataflow, and discussed the pros and cons of using it. However one thing I never got round to doing until this week is looking at how you can tell if query folding is happening on staged data inside a Dataflow – which turns out to be harder to do than you might think.

Consider the following simple Dataflow consisting of two queries:

The first query, called StageData, reads data from a 6GB, 17 million row CSV file containing open data from the English Prescribing Data dataset. It returns two columns called PRACTICE_NAME and TOTAL_QUANTITY from that CSV file:

Staging is enabled on this query:

The second query, called GroupBy, takes the data returned by the StageData query and does a Group By operation to get the sum of the values in the TOTAL_QUANTITY column for each distinct value in PRACTICE_NAME:

The output of this query was loaded to a Fabric Warehouse:

The scenario is basically the same as the one from my previous post but with a much larger data volume, and the idea was to test again whether it was faster to stage the data and do the Group By on the staged data or to not stage the data and do the Group By while reading the data direct from the source.

It turned out that, once again, staging made performance worse (don’t worry, I have other tests that show it can help performance). But the point about staging is that by loading the data from a query into a hidden Fabric Lakehouse, managed by the Dataflow (which is what is meant by “staging”), any subsequent operations on this data are faster because query folding can take place against the SQL Endpoint of this hidden Lakehouse – and at the time of writing this post there’s no way of knowing from the Dataflows Editor whether query folding is taking place. Right-clicking on the step that does the Group By operation shows that the “View data source query” option is greyed out but this only tells you that you the Editor doesn’t know if folding is taking place:

In fact other things in the UI, such as the query plan and the query folding indicators, suggest incorrectly that folding is not taking place:

So I thought: if query folding is taking place then the Group By will result in a SQL query run against the hidden Lakehouse, so maybe I can see this SQL query somewhere? Unfortunately since the Lakehouse is hidden you can’t get to it through the Fabric web interface. But then I remembered that you can connect to a Fabric workspace using good old SQL Server Management Studio (instructions on how to can be found here). And when I connected using SSMS I could see two hidden objects created by by Dataflow called StagingLakehouseForDataflows and StagingWarehouseForDataflows:

I was then able to run a SQL query using the queryinsights.exec_requests_history DMV against StagingWarehouseForDataflows, filtered for the time range when my Dataflow refresh was taking place:

			
SELECT start_time,statement_type, command, total_elapsed_time_ms
FROM 
    queryinsights.exec_requests_history 
WHERE 
start_time>'insert DF refresh start time here' 
AND end_time<'insert DF refresh end time here'
ORDER BY start_time desc

		

…and saw the following INSERT INTO statement that did the Group By operation I was expecting to see along with how long it took:

			
insert into 
[StagingWarehouseForDataflows_20260223144925].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_2930fd3c_002D2a62_002D4518_002Dafbf_002D249e7af54403] 
([Column1], [Column2])  
select [_].[Column1] as [Column1],      
convert(float, [_].[Total Quantity]) as [Column2]  
from   (      
select [rows].[Column1] as [Column1],          
sum([rows].[Column2]) as [Total Quantity]      
from [StagingLakehouseForDataflows_20260223144911].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_179186be_002D367d_002D4924_002Da8ba_002Dd1f220415e3a] 
as [rows]      
group by [Column1]  ) 
as [_]

		

So, a useful tip if you’re performance tuning a Dataflow even if it’s a bit of a pain to do. Hopefully in the future we’ll be able to see the SQL generated when query folding takes place against a staged table.

[Thanks to Miguel Escobar for his help with this]

When Can Partitioned Compute Help Improve Fabric Dataflow Performance?

Partitioned Compute is a new feature in Fabric Dataflows that allows you to run certain operations inside a Dataflow query in parallel and therefore improve performance. While UI support is limited at the moment it can be used in any Dataflow by adding a single line of fairly simple M code and checking a box in the Options dialog. But as with a lot of performance optimisation features (and this is particularly true of Dataflows) it can sometimes result in worse performance rather than better performance – you need to know how and when to use it. And so, in order to understand when this feature should and shouldn’t be used, I decided to do some tests and share the results here.

For my tests I created two queries within a single Dataflow Gen2 CICD. First, an M function called SlowFunction that takes a numeric value and returns that value with 1 added to it after a two second delay:

			
(input as number) as number => 
Function.InvokeAfter(()=>input+1, #duration(0,0,0,2))

Then the main query which returns a table of ten rows and calls the SlowFunction M function once per row:

			
let
  Rows = List.Transform({1 .. 10}, each {_}), 
  MyTable = #table(type table [RowNumber = number], Rows), 
  #"Added custom" = Table.TransformColumnTypes(
    Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])), 
    {{"FunctionOutput", Int64.Type}}
  )
in
  #"Added custom"

		

Here’s what the output of the query looks like:

Now, the first important question. How long does this Dataflow take to run? 10 rows calling a function that takes 2 seconds to run, so 10*2=20 seconds maybe? The answer is yes if you look at how long the preview takes to populate in the Dataflow Editor:

That’s just the preview though. If you’re refreshing a Dataflow there are other things that happen that affect how long that refresh takes, such as Staging and loading the data to a destination. There’s no way you can split the performance of your M code from these operations when looking at the duration of a Dataflow refresh in Recent Runs, which explains why some of the timings you will see later in this post seem strange. Don’t worry, though, it doesn’t stop you from seeing the important trends. I’m told that setting a CSV file in a Lakehouse as your data destination is the best way of minimising the impact of loading data on overall refresh durations but at the time of writing the CSV destination can’t be used with Partitioned Compute so all my tests used a Fabric Warehouse as a destination.

Here’s what Recent Runs showed when this Dataflow was refreshed:

The overall refresh time was 59 seconds; the query (called NonPartitioned here) that returns the table of ten rows and which was staged took 29 seconds.

Could this Dataflow benefit from Partitioned Compute? With Partitioned Compute enabled in the Options dialog, I added the necessary M code to the query:

			
let
  Rows = List.Transform({1 .. 10}, each {_}), 
  MyTable = #table(type table [RowNumber = number], Rows), 
  #"Added custom" = Table.TransformColumnTypes(
    Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])), 
    {{"FunctionOutput", Int64.Type}}
  ), 
  ReplacePartitionKey = Table.ReplacePartitionKey(#"Added custom", {"RowNumber"})
in
  ReplacePartitionKey

		

…and then refreshed. Here’s what Recent Runs showed:

The overall refresh duration went up to 1 minute 25 seconds; the query that does all the work (called Partitioned in this case) took 40 seconds. Note the screenshot immediately above shows the engine used is “PartitionedCompute” and that there are now ten activities listed instead of one: my M code used the RowNumber column in the table as the partition key so the Dataflow attempted to run each row of the table as a separate operation in parallel. And as you can see, this made performance worse. This is because using Partitioned Compute introduces yet another overhead and that overhead is much greater than any benefit gained from parallelism in this case.

So I wondered: what if the delay in the query is increased from 2 second to 100 seconds then? Does this increase in the delay mean that parallelism results in faster overall performance?

Here’s what Recent Runs showed for a version of my Dataflow with a 100 second delay for each row and which didn’t use Partitioned Compute:

10 rows * 100 seconds = 1000 seconds = 16 minutes 40 seconds, so it’s not surprising that the overall duration of this version of the Dataflow was slow at 17 minutes 29 seconds.

Here’s what Recent Runs shows for the version of this Dataflow that did use Partitioned Compute:

The overall duration was 4 minutes 41 seconds and the main query took 3 minutes 14 seconds. The important takeaway is that this is a lot faster than the version that didn’t use Partitioned Compute, so clearly Partitioned Compute made a big difference to performance here. As you might expect, it looks like parallelising operations that only take a few seconds results in worse performance while parallelising operations that take longer, say a minute or more, is probably a good idea. As always, you’ll need to test to see what benefits you get for your Dataflows.

These results raise a lot of questions too. 100 seconds = 1 minute 40 seconds, which is a lot less than 3 minutes 14 seconds. Does this mean that not every row in the table was evaluated in parallel? Is partitioning on the RowNumber column counter-productive and would it be better to partition in some other way to try to reduce the amount of attempted parallelism? Is there something else that is limiting the amount of parallelism? While this version of the Dataflow always performs better than the non-partitioned version, performance did vary a lot between refreshes. While these tests show how useful Partitioned Compute can be for slow Dataflows, there’s a lot more research to do and a lot more blog posts to write.

A Closer Look At Preview-Only Steps In Fabric Dataflows

I have been spending a lot of time recently investigating the new performance-related features that have rolled out in Fabric Dataflows over the last few months, so expect a lot of blog posts on this subject in the near future. Probably my favourite of these features is Preview-Only steps: they make such a big difference to my quality of life as a Dataflows developer.

The basic idea (which you can read about in the very detailed docs here) is that you can add steps to a query inside a Dataflow that are only executed when you are editing the query and looking at data in the preview pane; when the Dataflow is refreshed these steps are ignored. This means you can do things like add filters, remove columns or summarise data while you’re editing the Dataflow in order to make the performance of the editor faster or debug data problems. It’s all very straightforward and works well.

The more I thought about this feature, though, the more I wondered about how it actually works and how it can be used in more complex queries that involve hand-written M code – so I decided to do a few tests. First of all, consider the following M query:

			
let
  Step1 = 1,
  Step2 = Step1 - 1
in
  Step2

		

This query returns a numeric value: 0.

At this point it will return the same value in the editor when viewing the preview of the output and when the Dataflow refreshes. But if you right-click on Step2 and select “Enable only in previews” then the query still returns 0 in the preview but will return 1 when the Dataflow refreshes. You will also see this message displayed in the preview pane:

			
This data preview uses preview-only steps. 
Results may differ when running the dataflow.

This makes sense because the query follows a linear pattern: the output references Step2 which in turn references Step1 so if you disable Step2 then the the output simply skips it and returns the value of Step1.

But what if your M code is more complex? For example consider this query:

			
let
  x = 5,
  y = 8,
  xtimesy = x*y,
  outputtable = #table(type table[xy=number],{{xtimesy}})
in
  outputtable

		

Here’s what this returns in the editor: a table that contains the value 40.

What if you disable the step called “y”?

Here’s what I saw in my Warehouse when I loaded the output of the query to it:

I had no idea what the output would be before I saw it but, thinking about it, I assume what has happened is that since the step “y” has been disabled, “y” simply returns the value of the previous step in the query, “x”, and therefore the output becomes 5*5=25. This understanding of how preview-only steps work is backed-up by the fact that it is not possible to set step “x” to be preview-only. The option to do so is greyed out:

This all leads on to a pattern I used in a Dataflow this week and which I can see myself using a lot more in the future. While being able to disable steps when the Dataflow refreshes is very useful, sometimes what you need is to write some conditional logic in your M code that does one thing if you’re in the editor and another thing if the Dataflow is refreshing. Here’s a slightly modified version of the first query above which I have called IsRefresh in my Dataflow:

			
let
  RefreshValue = 1,
  PreviewValue = RefreshValue-1
in
  PreviewValue

		

With the PreviewValue step set to be preview-only:

…then what we have here is a query that returns 1 when the Dataflow is refreshing and 0 when you’re viewing its output in the editor. Here’s an example of how it can be used in a query:

			
let
  Source = #table(
    type table [
      NumericValue = number,
      TextValue    = text
    ],
    {
      {
        IsRefresh,
        if IsRefresh = 1 then
          "This is a refresh"
        else
          "This is a preview"
      }
    }
  )
in
  Source

		

This query returns the following when you preview the output in the editor:

But when the Dataflow refreshes, the output in the destination is this:

Report On SAP And Salesforce Data In Fabric With Business Process Solutions

If you want to build a reporting solution on SAP (S/4HANA or ECC) or Salesforce data in Fabric and don’t want to build everything from scratch then you should check out Business Process Solutions. It’s a free, Microsoft-developed solution currently in public preview; the announcement blog post from last year is here and you can find all the docs here. It’s implemented as a Fabric custom workload which means that you can deploy it to a new workspace easily with just a few clicks, although there is of course a bit of configuration needed so it can connect to your data sources.

After you’ve done that it will generate all the Fabric items (pipelines, semantic models, Power BI reports etc) needed and you can concentrate on analysing your data. I know the team that is building Business Process Solutions and they are very smart so I’m sure what they’ve built is well designed.Check it out!

Power BI, Parallelism And Dependencies Between SQL Queries In DirectQuery Mode

This is going to sound strange, but one of the things I like about tuning Power BI DirectQuery semantic models is that their generally-slower performance and the fact you can see the SQL queries that are generated to get data makes it much easier to understand some of the innermost workings of the Power BI engine. For example this week I was trying to tune a DAX query on a DirectQuery model using DAX Studio and the Server Timings showed me something like this:

As I described here, Power BI can send SQL queries in parallel in DirectQuery mode and you can see from the Timeline column there is some parallelism happening here – the last two SQL queries generated by the DAX query run at the same time – but everything has to wait for that first SQL query to complete. Why? Can this be tuned?

Here’s the scenario that produced the query above. I have a DirectQuery semantic model built from the ContosoDW SQL Server sample database:

There are three base measures defined:

			
Distinct Customers = DISTINCTCOUNT(FactOnlineSales[CustomerKey])
January Customers = 
CALCULATE([Distinct Customers], 
KEEPFILTERS('DimDate'[CalendarMonthLabel]="January"))
Monday Customers = 
CALCULATE([Distinct Customers], 
KEEPFILTERS('DimDate'[CalendarDayOfWeekLabel]="Monday"))

		

Note that these measures are written specifically to prevent fusion from taking place: each measure generates a separate SQL query. Here’s what DAX Studio’s Server Timings shows for the DAX query generated for the table shown above:

As you can see, the three SQL queries generated by the DAX query are run in parallel.

Now consider the following measure:

			
IF Test = IF([Distinct Customers]>3000, [January Customers], [Monday Customers])

Here’s what this measure returns:

If you run the query generated for this visual in DAX Studio, Server Timings shows what I showed in the first screenshot in this post:

The last two substantial SQL queries, on lines 4 and 5, can only run when the first SQL query, on line 1, has finished. The details of SQL queries tell you more about what’s going on here. The first SQL query, on line 1, just gets the values for the [Distinct Customers] measure for all rows in the table:

The WHERE clauses for the SQL queries on line 4:

..and line 5:

…show that these last two queries only get the values for the [January Customers] and [Monday Customers] measures for the rows where the [IF Test] measure needs to display them. And this explains why the first SQL query has to finish before these last two SQL queries can be run: the WHERE clauses of these last two SQL queries are constructed using the results returned by the first SQL query.

There is another way of evaluating the IF condition in the [IF Test] measure. Instead of “strict” evaluation, where the engine only gets the value of [January Customers] for the rows in the table where [Distinct Customers] is greater than 3000 and only gets the value [Monday Customers] for the remaining rows, it can get values for [January Customers] and [Monday Customers] for all rows in the table and then throw away the values it doesn’t need. This is “eager” evaluation and as you would expect, Marco and Alberto have a great article explaining strict and eager evaluation here that is worth reading; Power BI can decide to use either strict or eager evaluation with the IF function depending on which one it thinks will be more efficient. However you can force the use of eager evaluation by using the IF.EAGER DAX function instead of IF:

			
IF EAGER Test = 
IF.EAGER([Distinct Customers]>3000, [January Customers], [Monday Customers])

Here’s what Server Timings shows for the DAX query that uses IF.EAGER:

As you can see, the use of IF.EAGER means that the three substantial SQL queries generated by Power BI for this DAX query can now be run in parallel because there are no dependencies between them: they get the values of [Distinct Customers], [January Customers] and [Monday Customers] for all rows in the table. However, even though these three SQL queries are now run in parallel, it doesn’t result in any performance benefits here because it looks like the three queries are slower as a result of all being run at the same time. Power BI has made the right call to use strict evaluation with the IF function in this case but if you see it using strict evaluation I think it’s worth experimenting with IF.EAGER to see if it performs better – especially in DirectQuery mode where Power BI knows less about the performance characteristics of the database you’re using as your data source.

[Thanks to Phil Seamark for helping me understand this behaviour]

Measuring Power BI Report Page Load Times

If you’re performance tuning a Power BI report the most important thing you need to measure – and the thing your users certainly care about most – is how long it takes for a report page to load. Yet this isn’t something that is available anywhere in Power BI Desktop or in the Service (though you can use browser dev tools to do this) and developers often concentrate on tuning just the individual DAX queries generated by the report instead. Usually that’s all you need to do but running multiple DAX queries concurrently can affect the performance of each one, and there are other factors (for example geocoding in map visuals or displaying images) that affect report performance so if you do not look at overall page render times then you might miss them. In this post I’ll show you how you can measure report page load times, and the times taken for other forms of report interaction, using Performance Analyzer in the Service and Power Query.

Consider the following series of interactions with a published Power BI report:

The report itself isn’t really that important – just know that there are a series of interactions with a slowish report while Performance Analyzer is running. Here’s what Performance Analyzer shows by the end of these interactions:

Here’s a list of the interactions captured:

I changed from a blank report page to a page with a table visual, where the table visual was cached and displayed immediately
I then refreshed the table visual on that page by clicking the Refresh Visuals button in the Performance Analyzer pane
I changed to the next page in the report and all the visuals on that page rendered
I changed the slicer on that new page
I clicked on the bar chart to cross-filter the rest of the page

As you can see from the screenshot above, Performance Analyzer tells you how long each visual takes to render within each interaction but it doesn’t tell you how long each interaction took in total. In a lot of cases you can assume that the time taken for an interaction is the same as the time taken for the slowest visual to render, but that may not always be true.

So how can you use Performance Analyzer to measure the time taken for these interactions? How can you measure the amount of time taken to render a page in a report?

To solve this problem I created a Power Query query that takes the event data JSON file that you can export from Performance Analyzer and returns a table showing the amount of time taken for each interaction. Here’s the M code for this query:

			
let
    Source = Json.Document(File.Contents("C:\PowerBIPerformanceData.json")),
    ToTable = Table.FromRecords({Source}),
    Events = ToTable{0}[events],
    EventTable = Table.FromList(Events, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Expanded Column1" = Table.ExpandRecordColumn(EventTable, "Column1", {"name", "start", "id", "metrics", "end"}, {"name", "start", "id", "metrics", "end"}),
    #"Expanded metrics" = Table.ExpandRecordColumn(#"Expanded Column1", "metrics", {"sourceLabel"}, {"sourceLabel"}),
    #"Added Custom1" = Table.AddColumn(#"Expanded metrics", "UserActionID", each if [name]="User Action" then [id] else null),
    #"Added Custom2" = Table.AddColumn(#"Added Custom1", "UserActionLabel", each if [name]="User Action" then [sourceLabel] else null),
    #"Changed Type" = Table.TransformColumnTypes(#"Added Custom2",{{"start", type datetime}, {"end", type datetime}, {"UserActionID", type text}, {"sourceLabel", type text}, {"UserActionLabel", type text}}),
    #"Filled Down" = Table.FillDown(#"Changed Type",{"UserActionID", "UserActionLabel"}),
    #"Filtered Rows" = Table.SelectRows(#"Filled Down", each [start] > #datetime(1970, 1, 2, 0, 0, 0)),
    #"Filtered Rows1" = Table.SelectRows(#"Filtered Rows", each [end] > #datetime(1970, 1, 2, 0, 0, 0)),
    #"Grouped Rows" = Table.Group(#"Filtered Rows1", {"UserActionID", "UserActionLabel"}, {{"Start", each List.Min([start]), type nullable datetime}, {"End", each List.Max([end]), type nullable datetime}}),
    #"Added Custom" = Table.AddColumn(#"Grouped Rows", "Duration", each [End]-[Start], type duration),
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"UserActionID"})
in
    #"Removed Columns"

		

Here’s the output of this query for the interactions shown above:

Some notes about this query:

You will need to change the Source step to point to the JSON file you have exported from Performance Analyzer
Each interaction is represented by a row in the table and identified by the UserActionLabel column
I’m calculating the durations by finding the minimum start time and the maximum end time for all events associated with an interaction and subtracting the former from the latter
There’s a bug (which hopefully gets fixed at some point) where some events have start and end dates in 1970, so I have filtered out any dates that are obviously wrong
The Duration column shows how long each interaction took and uses the Power Query duration data type, which is formatted as days.hours:minutes:seconds

The example above is fairly complex showing several different kinds of interactions. If you just want to find the amount of time taken to render all the visuals on a page you can click the Refresh Visuals button in Performance Analyzer to refresh all the visuals on the page – it may not give you a 100% “cold cache” page render but it will be good enough. I’m not a web developer but I think to really do things properly you’ll need to open the report on a blank page in the browser, do an “Empty Cache Hard Reload“, go to edit mode in the report, enable Performance Analyzer, then move to the page you want to test. If you’re testing a DirectQuery model then you’ll also want to include the overhead of opening connections (which can be substantial); the only way I have found to do that is either wait for at least an hour for any connections in the pool to be dropped, or if you’re using a gateway to restart it. One last point to make is that while you can use Performance Analyzer in Power BI Desktop and in the browser the behaviour of Power BI may be different in these two places, so always make sure you measure performance of published reports in the browser because that’s where your users will be using your reports.

Here’s what clicking the Refresh Visuals button in Performance Analyzer to refresh all the visuals on a page looks like:

This results in a single interaction and a single row in the output of the Power Query query above:

In this case you can see that the page refresh took 12.14 seconds.

As you will have realised by now, getting the amount of time it takes to load a report page isn’t straightforward and there are a lot of factors to take into account. Nonetheless using Performance Analyzer in this way is much better than not measuring page load times at all or (as I’ve seen some people do) using a stopwatch. If you try this and find something interesting please let me know: I’m doing a lot of testing with Performance Analyzer and learning new things all the time.

New Books: “The Definitive Guide To DAX” 3rd Edition And “Microsoft Power BI Visual Calculations”

For some reason I haven’t had any free copies of books to review recently; maybe the market for tech books has finally collapsed with AI? Books are still being published though and luckily, as someone who once published a book via an O’Reilly imprint, I have a lifetime subscription to O’Reilly online learning which gives me free access to all the tech books I ever need. Two books were published in the last few months that I was curious to read: the third edition of “The Definitive Guide To DAX” by my friends Marco Russo and Alberto Ferrari, and “Microsoft Power BI Visual Calculations” by my colleague Jeroen ter Heerdt, Madzy Stikkelorum and Marc Lelijveld. As I’ve said many times, I don’t write book reviews here (least of of reviews of books by friends or colleagues where I could never be unbiased), but I think there’s some value sharing my thoughts on these books.

“The Definitive Guide To DAX”, 3rd Edition

It’s generally accepted that the one book that anyone who is serious about Power BI should own is “The Definitive Guide To DAX”. If you don’t already own a copy you should buy one, but since most people who read my blog probably have one already the more interesting question to ask is what’s new in the third edition and whether it’s worth upgrading – especially since I’d seen Marco say that the book had been completely rewritten. I’ve heard the “completely rewritten” line before and I was sceptical but it turns out that it really is a very different book. It’s not completely rewritten because there is material there from previous editions but there are a lot of changes.

First of all, as you would expect, all the new additions to DAX since the second edition was published are covered including user defined functions, visual calculations, calendar-based time intelligence functionality and window functions. These are all really important features you will want to use in your semantic models and reports so this is the main reason you’d want to buy a copy of this edition.

Secondly, the main (and justfied) criticism of the previous editions was that they were, as we say in the UK, “heavy going”. They had absolutely all the information you would ever need but they were not the easiest books to read or understand. That has been addressed in the third edition: the tone is a little bit more friendly and difficult concepts are now explained visually as well as in text. As a result it’s easier to recommend the book for beginners.

Thirdly, some advanced topics (for example around performance tuning) have been dropped. For example I searched for the term “callback” in this new edition and found no mentions; that’s not true of the second edition. I have mixed feelings about this because it means the book isn’t as “definitive” as it used to be, but I can understand why it’s happened: with so much new content to add, keeping these advanced topics would have made an already long book too long. And let’s be honest, how often do you look at the details of a DAX query plan? If the aim is to teach DAX then cutting content means it’s easier for the reader to focus on the core concepts.

In summary, then, another great piece of work from Marco and Alberto and worth buying even if you have a copy of an earlier edition.

“Microsoft Power BI Visual Calculations”

A whole book about visual calculations? As I mentioned above, they’re covered in one chapter of “The Definitive Guide To DAX” but that book focuses on DAX; this one takes more time to explain the concepts and, crucially, includes a lot of practical examples of how to use them. Like user-defined functions, when visual calculations were released there was an explosion of community content showing how they can be used to solve problems that were difficult to solve in Power BI before – problems that no-one could have been anticipated that would be solved with visual calculations. The real value of this book is showing how to build a bump chart or a tornado chart with visual calculations and that makes it worth checking out.

Closing thoughts: why buy a book?

As you would expect, a lot of the information contained in these books is already available for free somewhere on the internet. And with AI you don’t even need to know how to search for it or stitch it all together – you can ask a question and get an answer customised to your exact scenario. So why buy books any more? I guess it depends on whether you only want to get your problems solved or understand how to solve problems yourself. For me (even though my attention span has eroded in recent years, just like everyone else’s) the only way to grasp really difficult concepts is through long-form written explanations or training courses, not fragments found in blog posts or 10-minute videos. I suspect that AI is the final nail in the coffin of the tech publishing industry but the tech book industry not being viable any more is not the same thing as tech books not being useful any more. Or maybe I’m just old-fashioned.

Chris Webb's BI Blog

Power BI And Support For Third Party Semantic Models

Like this:

Role-Playing Dimensions In Fabric Direct Lake Semantic Models Revisited

Like this:

Generating Excel Reports With Fabric Dataflows Gen2

Like this:

Query Folding And Staging In Fabric Dataflows Gen2

Like this:

When Can Partitioned Compute Help Improve Fabric Dataflow Performance?

Like this:

A Closer Look At Preview-Only Steps In Fabric Dataflows

Like this:

Report On SAP And Salesforce Data In Fabric With Business Process Solutions

Like this:

Power BI, Parallelism And Dependencies Between SQL Queries In DirectQuery Mode

Like this:

Measuring Power BI Report Page Load Times

Like this:

New Books: “The Definitive Guide To DAX” 3rd Edition And “Microsoft Power BI Visual Calculations”

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: