Generating Sample Data In Fabric Dataflows With FabricAI.Prompt()

Back in December the FabricAI.Prompt() M function was released in Fabric Dataflows Gen2. Most of the people writing about it at that time, as in this great post by my colleague Sandeep Pawar, focused on calling this function for each row in a table – something that the UI in the editor makes easy. However the FabricAI.Prompt() function itself is a lot more flexible. You can use it to summarise whole tables of data as I showed here; you can also use it to generate sample data. This is similar to what I blogged about here where I got Copilot to generate M code that returned sample data but using FabricAI.Prompt() is maybe a bit simpler.

The trick is to get FabricAI.Prompt() to generate a table of data in CSV format. Here’s an example prompt:

			
Generate a table of sample sales data in CSV format. 
The table should have three columns called Country, Product and Sales. 
The Country column should contain the names of random European countries. 
The Product column should contain the names of random types of fruit. 
The Sales column should contain random numbers between 1 and 100. 
The table should contain 10 rows. 
The first row of text returned should contain the column names. 
Subsequent rows should contain the data.

		

…and here’s how this prompt can be used with Fabric.AIPrompt() and how the text that the function returns can be turned into a table using the CSV.Document function:

			
let
  Source = FabricAI.Prompt(
    "Generate a table of sample sales data in CSV format. The table should have three columns called Country, Product and Sales. The Country column should contain the names of random European countries. The Product column should contain the names of random types of fruit. The Sales column should contain random numbers between 1 and 100. The table should contain 10 rows. The first row of text returned should contain the column names. Subsequent rows should contain the data."
  ), 
  ToCSV = Csv.Document(Source), 
  #"Promoted headers" = Table.PromoteHeaders(
    ToCSV, 
    [PromoteAllScalars = true]
  ), 
  #"Changed column type"
    = Table.TransformColumnTypes(
    #"Promoted headers", 
    {
      {"Country", type text}, 
      {"Product", type text}, 
      {"Sales", Int64.Type}
    }
  )
in
  #"Changed column type"

		

And here’s an example of the result:

Power BI Semantic Model Refresh Warnings

Since March 2026, Power BI semantic models have started showing warnings in their Refresh History in the Service. This has scared a few people but in fact all that is happening is that errors which were there all along and which don’t prevent refreshes from completing are now being flagged. Documentation on this feature can be found here but let’s see an example of the type of errors that can cause these warnings.

Consider the following semantic model that consists of a calculated table called Table With Error and a physical table called Sales with two physical columns called Product and Sales, two calculated columns called Sales Forecast and VAT Forecast, and two measures called Sales Amount and Tax Amount.

Here are the definitions of the calculated columns:

			
Sales Forecast = 'Sales'[Sales] * 1.1
VAT Forecast = 'Sales'[VAT] * 1.1

Here are the definitions of the measures:

			
Sales Amount = SUM('Sales'[Sales])
Tax Amount = SUM(Sales[Tax])

And here is the the definition of the calculated table:

			
Table With Error = 
FILTER(
'TableThatDoesNotExist', 
'TableThatDoesNotExist'[ColumnThatDoesNotExist]>1
)

		

There are some problems here: the VAT Forecast calculated column, the Tax Amount measure and the Table With Error calculated table all return errors because they refer to tables or columns that do not exist. You can see these errors in Power BI Desktop easily, for example in the Data pane where these items have warning triangles next to them:

…or if you look at their definitions:

None of these errors stop you from refreshing or publishing but of course you can’t use any of these items in your reports.

If you do publish and refresh this semantic model via the UI (although this does not happen if you refresh via the XMLA Endpoint) you’ll see the message “Refresh completed with warnings”:

If you click the Show link in the Details column and then the Show link in the yellow box that appears, you’ll see a dialog showing the errors for all the broken items:

If you see warnings like this you should probably go and either fix the items that are causing them or delete them. Errors like this happen frequently when you delete items in your semantic model that have measures, calculated columns or calculated tables that depend on them; there are plenty of other similar scenarios that will cause errors too.

Power BI And Support For Third Party Semantic Models

I’ve been working with Microsoft BI tools for 28 years now and for all that time Microsoft has been consistent in its belief that semantic models are a good thing. Fashions have changed and at different times the wider BI industry has agreed and disagreed with this belief; right now, semantic models are cool again because everyone has realised how important they are for AI. As a result, some of Microsoft’s partners and competitors (and sometimes it’s not clear which is which) have invested in building their own semantic models and/or metrics stores, some of which don’t work at all with Power BI, some of which only work with significant limitations, and a very small number which are fully supported and work with only minor limitations. This naturally raises the question of whether Power BI will ever work properly with any or all of them. The answer is no, and in this blog post I’ll explain why.

The first thing to make clear is that the reasons why some semantic models work well with Power BI and others don’t are purely technical. It is not because Microsoft has some grand plan to stifle competing BI tools. If you look at Fabric as a whole, you’ll see that Microsoft works closely with Databricks, Snowflake, DBT and many other companies to ensure that it integrates closely with them and gives customers the option to work with whichever other tools they want to use. In Power BI there are connectors to a wide range of data sources, not just Microsoft ones. Over the last year the Power BI team has spoken to all major vendors of third-party semantic models about integration with Power BI and it has been clear about what is and isn’t technically feasible. The door remains open for future collaboration and Microsoft respects the motives of these other vendors, in particular those who are developing open standards.

To understand the technical issues, let’s look at the architecture of a simple Power BI solution that uses an Import mode semantic model – as the vast majority of Power BI solutions do:

In this case the data from the data sources is copied into the Power BI semantic model, which also contains information on how the different tables of data should be joined to each other, measures (defined in the DAX language) describing how data should be aggregated and how more complex business calculations should be performed, which columns are visible and which ones are hidden, and a lot more. When the Power BI report is rendered it sends queries, again in the DAX language, to the semantic model to get the data it needs for each visual.

How could a third-party semantic model be used instead here? Power BI reports connect to Power BI semantic models using the XMLA protocol, and that means that Power BI reports can also connect to older Azure Analysis Services and SQL Server Analysis Services semantic models too. Some vendors have come up with a solution whereby they implement support for XMLA and tell their customers to connect to their semantic models using the SQL Server Analysis Services connector. This works up to a point but as you can imagine, using the SQL Server Analysis Services connector to connect to something that isn’t SQL Server Analysis Services is not supported and not wholly reliable.
It’s worth noting that using a third-party semantic model as a data source for an Import mode Power BI semantic model is not an option either because if Power BI imports metrics like percentage shares or time intelligence calculations it will not be able to aggregate data and get the correct result. Most metrics need to be calculated after the base data has been aggregated to work properly.

There are two other storage modes available for Power BI semantic models: Direct Lake and DirectQuery. Direct Lake only works with data stored in, or which can be reached via a shortcut from, Fabric OneLake so we don’t need to discuss it here. In DirectQuery mode the Power BI semantic model doesn’t store any data and instead, when it is queried, it generates SQL queries to get the data it needs from a data source on demand.

Other vendors of third-party semantic models have taken the approach of suggesting the use of Power BI in DirectQuery mode and having it run SQL against their semantic model. Apart from the fact that DirectQuery mode is usually slower and less cost-effective than Import mode or Direct Lake mode, your first reaction to this would probably be that putting one semantic model on top of another semantic model doesn’t make any architectural sense and you’d be right. There are several serious problems that emerge when you try to use Power BI in this way.

For example, a Power BI semantic model assumes that you have your data modelled as a star schema and that it will be able to generate SQL that joins dimension tables to fact tables. Not all third-party semantic models support something as basic as this yet. What’s more a Power BI semantic model assumes that it will be where all metrics will be calculated, which means that despite some interesting workarounds by third-party vendors (such as making the SQL SUM() function not actually sum up values) you can never be sure that you’ll get the correct values for a metric defined in a third-party semantic model, for example for subtotals or grand totals. There are a lot of other, similar problems that the Power BI team have made these third-party semantic model vendors aware of. These problems are not specific to Power BI semantic models either: no other semantic model would work well with another semantic model as its source.

If you can’t use Power BI semantic models on top of third-party semantic models, is it an option to synchronise calculations defined in a third-party semantic model to a Power BI semantic model? Yes, that is certainly possible and supported, and some of our partners (such as our friends at Tabular Editor) have already started down this path. DAX is a very rich language for defining metrics and Microsoft has invested a lot recently in making changes to Power BI semantic models programmatically as easy as possible. Without a doubt any metrics defined in a third-party semantic model can be reproduced in DAX, although since DAX is a much better fit for defining metrics than SQL you’ll probably find that some of the metrics you need can only be defined in DAX. In which case, rather than defining some of your metrics in a third-party semantic model and some in a Power BI semantic model, why not define all of them in your Power BI semantic model?

The final point to make is that Power BI semantic models can be used with a wide range of BI tools, not just Power BI reports. Apart from Microsoft tools like Excel and Fabric Paginated Reports, Tableau and several other non-Microsoft tools that you might think of as competitors to Power BI can also be used as a front-end for Power BI semantic models and this is supported. There is nothing stopping other BI tools from implementing connectivity to Power BI semantic models in the future. In Fabric you can even query a Power BI semantic model in SQL and extract data into a Pandas Dataframe in Python using the Semantic Link library. Anyone arguing that Power BI semantic models are somehow not “open” is wrong.

I’ll be honest, I think a lot of the reason why organisations that already use Power BI extensively consider third-party semantic models is because some people – not the Power BI users themselves, often people from a data engineering or database background – think of Power BI as just a visualisation tool and don’t realise that it also has the most mature, capable, widely used semantic model available in the market today. It is designed for both self-service and enterprise BI scenarios. Microsoft has no plans to make Power BI’s front end work properly with anything other than its own semantic models because that would be a huge amount of work with few benefits to customers: these third-party semantic models all behave differently and are at different levels of maturity, so any changes made in Power BI to accommodate them would risk breaking existing functionality or limit the use of advanced features. 35 million users view Power BI reports every month and those users query 20 million Power BI semantic models. Microsoft’s strategy is to continue to invest and strengthen Power BI semantic models for those customers. So if Power BI is how you want your end users to consume data, then Power BI semantic models, not any other third-party semantic model or metrics store, are the right place to store your metrics definitions and your business logic.

Role-Playing Dimensions In Fabric Direct Lake Semantic Models Revisited

Back in September 2024 I wrote a blog post on how to create multiple copies of the same dimension in a Direct Lake semantic model without creating copies of the underlying Delta table. Not long after that I started getting comments that people who tried following my instructions were getting errors, and while some bugs were fixed others remained. After asking around I have a workaround (thank you Kevin Moore) that will avoid all those errors, so while we’re waiting for the remaining fixes here are the details of the workaround.

Let’s say you have a Direct Lake on OneLake semantic model with two tables, a fact table called Conversation and a dimension table called Person. The Conversation fact table has one row for a conversation between two people, but at this point there is only one Person dimension in the model with a relationship from the FromPersonId column on Conversation to the PersonId column on Person:

How can you add a second copy of the Person dimension table without duplicating the data in OneLake?

In Power BI Desktop, while editing the semantic model, go to TMDL View and in the Data pane on the right hand side switch to the Model pane:

Expand Expressions and drag it into the TMDL pane to script it out. It should look something like this:

Then you need to make two changes:

On the line that starts “expression”, line 3 in the screenshot above, change the name of the expression to something new and unique
Delete the line that contains the lineage tag, line 8 in the screenshot above

Here’s what it should look like after:

Click Apply and this will create a duplicate Expression in the model. This is the trick that makes everything else work.

Next, create a new script in TMDL View and drag the Person dimension into it to script it out.

Then make the following changes to the script:

On the line that starts “table”, line 3 in the screenshot above, change the name of the table to something new and unique
One the line that starts “expressionSource”, line 31 in the screenshot above, change the name of the source expression to that of the new Expression created earlier
Delete all lines with lineage tags, ie those that start “lineageTag”
Add one line at the end with the text “changedProperty = Name”

Here’s what it should look like after:

Click Apply and this will create a copy of the dimension in the semantic model.

Then, back in Diagram View, you’ll see the new dimension table but with a warning saying that it hasn’t been refreshed. The next step is to refresh the model using the Schema and Data option:

At this point the new dimension table can be used like any other table, so you can create the relationship between the ToPersonId column on Conversation and the PersonId column on the new ToPerson dimension:

Generating Excel Reports With Fabric Dataflows Gen2

So many cool Fabric features get announced at Fabcon that it’s easy to miss some of them. The fact that you can now not only generate Excel files from Fabric Dataflows Gen2, but that you have so much control over the format that you can use this feature to build simple reports rather than plain old data dumps, is a great example: it was only mentioned halfway through this blog post on new stuff in Dataflows Gen2 Nonethless it was the Fabcon feature announcement that got me most excited. This is because it shows how Fabric Dataflows Gen2 have gone beyond being just a way to bring data into Fabric and are now a proper self-service ETL tool where you can extract data from a lot of different sources, transform it using Power Query, and load it to a variety of destinations both inside Fabric and outside it (such as CSV files, Snowflake and yes, Excel).

The documentation for the new Excel destination, which you can find here, is extremely detailed indeed so I thought it would be useful to show a simple example of how you can now use Dataflows Gen2 to build an Excel report. First of all I created a query using the Enter Data source that returned a table with some sales data in:

			
let
  Source = Table.FromRows(
    Json.Document(
      Binary.Decompress(
        Binary.FromText(
          "i45WciwoyEktVtJRMlWK1YlW8i9KzEsH8w0NwAIBqYlFYK4RmOtelFgAkbZUio0FAA==",
          BinaryEncoding.Base64
        ),
        Compression.Deflate
      )
    ),
    let
      _t = ((type nullable text) meta [Serialized.Text = true])
    in
      type table [Product = _t, Sales = _t]
  ),
  #"Changed column type" = Table.TransformColumnTypes(
    Source,
    {{"Product", type text}, {"Sales", Int64.Type}}
  )
in
  #"Changed column type"

		

I then created a query called ReportTitle that contained the text for my report’s title:

			
let
  Source = #table({"Title"},{{"My Sales Report"}})
in
  Source

…and a query called FruitSalesOverview that passes the data from the FruitSales query to the FabricAI.Prompt M function to generate a text summary of it:

			
let
  Source  = FabricAI.Prompt("Summarise the fruit sales data in 20 words or less", FruitSales),
  ToTable = #table({"Summary"}, {{Source}})
in
  ToTable

		

The last query I created, called Output, generated a navigation table in the format described in the docs to describe the Excel output: the report title in a range starting in cell B1, the report summary in a range starting in cell B3, the sales data in a table starting in cell B5 and a bar chart showing the sales data.

			
let
  excelDocument = #table(
    type table [
      Sheet      = nullable text,
      Name       = nullable text,
      PartType   = nullable text,
      Properties = nullable record,
      Data       = any
    ],
    {
      // Report title
      {"Sales", "Title", "Range", [StartCell = "B1", SkipHeader = true], ReportTitle},
      // Copilot-generated summary
      {"Sales", "Summary", "Range", [StartCell = "B3", SkipHeader = true], FruitSalesOverview},
      // Table containing sales data
      {
        "Sales",
        "SalesTable",
        "Table",
        [StartCell = "B5", TableStyle = "TableStyleMedium9"],
        FruitSales
      },
      //Column chart containing sales data
      {
        "Sales",
        "SalesChart",
        "Chart",
        [
          ChartType  = "Column",
          ChartTitle = "Fruit Sales",
          DataSeries = [AxisColumns = {"Product"}, ValueColumns = {"Sales"}]
        ],
        #table({}, {}) meta [Name = "SalesTable"]
      }
    }
  )
in
  excelDocument

		

I then set the Data Destination of the Output query to use the New File option to create an Excel file and save it to the Files section of a Fabric Lakehouse. The use of the Advanced format option meant that the navigation table returned by the Output query was used to determine the structure of the resulting Excel file.

After refreshing the Dataflow I downloaded the resulting Excel workbook from my Lakehouse. Here’s what it looked like:

Pretty fun. Does it give you full control over the format of the Excel file? No, not quite. Is it a somewhat code-heavy approach? Yes, but I suppose in the age of AI that doesn’t matter so much since you’re unlikely to write your own code (although, being old-school, I adapted the code above from the docs manually). Most importantly: is this a better way of dumping data to Excel and/or generating simple Excel reports in Fabric than paginated reports? Good question, especially since Power Query is now available in paginated reports. I suspect the answer is that although paginated reports are harder to build (though you can generate rdl with AI too, I’ve done it) and that it’s harder to control what a paginated report rendered as an Excel file looks like, paginated reports may still have the edge if you want an Excel report but I’m not sure – factors like CU cost and how long it takes to generate an Excel file using each approach would also need to be taken into account. If you just want to dump data to Excel, however, Dataflows Gen2 are probably a better option now.

Query Folding And Staging In Fabric Dataflows Gen2

A few years ago I wrote this post on the subject of staging in Fabric Dataflows Gen2. In it I explained what staging is, how you can enable it for a query inside a Dataflow, and discussed the pros and cons of using it. However one thing I never got round to doing until this week is looking at how you can tell if query folding is happening on staged data inside a Dataflow – which turns out to be harder to do than you might think.

Consider the following simple Dataflow consisting of two queries:

The first query, called StageData, reads data from a 6GB, 17 million row CSV file containing open data from the English Prescribing Data dataset. It returns two columns called PRACTICE_NAME and TOTAL_QUANTITY from that CSV file:

Staging is enabled on this query:

The second query, called GroupBy, takes the data returned by the StageData query and does a Group By operation to get the sum of the values in the TOTAL_QUANTITY column for each distinct value in PRACTICE_NAME:

The output of this query was loaded to a Fabric Warehouse:

The scenario is basically the same as the one from my previous post but with a much larger data volume, and the idea was to test again whether it was faster to stage the data and do the Group By on the staged data or to not stage the data and do the Group By while reading the data direct from the source.

It turned out that, once again, staging made performance worse (don’t worry, I have other tests that show it can help performance). But the point about staging is that by loading the data from a query into a hidden Fabric Lakehouse, managed by the Dataflow (which is what is meant by “staging”), any subsequent operations on this data are faster because query folding can take place against the SQL Endpoint of this hidden Lakehouse – and at the time of writing this post there’s no way of knowing from the Dataflows Editor whether query folding is taking place. Right-clicking on the step that does the Group By operation shows that the “View data source query” option is greyed out but this only tells you that you the Editor doesn’t know if folding is taking place:

In fact other things in the UI, such as the query plan and the query folding indicators, suggest incorrectly that folding is not taking place:

So I thought: if query folding is taking place then the Group By will result in a SQL query run against the hidden Lakehouse, so maybe I can see this SQL query somewhere? Unfortunately since the Lakehouse is hidden you can’t get to it through the Fabric web interface. But then I remembered that you can connect to a Fabric workspace using good old SQL Server Management Studio (instructions on how to can be found here). And when I connected using SSMS I could see two hidden objects created by by Dataflow called StagingLakehouseForDataflows and StagingWarehouseForDataflows:

I was then able to run a SQL query using the queryinsights.exec_requests_history DMV against StagingWarehouseForDataflows, filtered for the time range when my Dataflow refresh was taking place:

			
SELECT start_time,statement_type, command, total_elapsed_time_ms
FROM 
    queryinsights.exec_requests_history 
WHERE 
start_time>'insert DF refresh start time here' 
AND end_time<'insert DF refresh end time here'
ORDER BY start_time desc

		

…and saw the following INSERT INTO statement that did the Group By operation I was expecting to see along with how long it took:

			
insert into 
[StagingWarehouseForDataflows_20260223144925].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_2930fd3c_002D2a62_002D4518_002Dafbf_002D249e7af54403] 
([Column1], [Column2])  
select [_].[Column1] as [Column1],      
convert(float, [_].[Total Quantity]) as [Column2]  
from   (      
select [rows].[Column1] as [Column1],          
sum([rows].[Column2]) as [Total Quantity]      
from [StagingLakehouseForDataflows_20260223144911].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_179186be_002D367d_002D4924_002Da8ba_002Dd1f220415e3a] 
as [rows]      
group by [Column1]  ) 
as [_]

		

So, a useful tip if you’re performance tuning a Dataflow even if it’s a bit of a pain to do. Hopefully in the future we’ll be able to see the SQL generated when query folding takes place against a staged table.

[Thanks to Miguel Escobar for his help with this]

When Can Partitioned Compute Help Improve Fabric Dataflow Performance?

Partitioned Compute is a new feature in Fabric Dataflows that allows you to run certain operations inside a Dataflow query in parallel and therefore improve performance. While UI support is limited at the moment it can be used in any Dataflow by adding a single line of fairly simple M code and checking a box in the Options dialog. But as with a lot of performance optimisation features (and this is particularly true of Dataflows) it can sometimes result in worse performance rather than better performance – you need to know how and when to use it. And so, in order to understand when this feature should and shouldn’t be used, I decided to do some tests and share the results here.

For my tests I created two queries within a single Dataflow Gen2 CICD. First, an M function called SlowFunction that takes a numeric value and returns that value with 1 added to it after a two second delay:

			
(input as number) as number => 
Function.InvokeAfter(()=>input+1, #duration(0,0,0,2))

Then the main query which returns a table of ten rows and calls the SlowFunction M function once per row:

			
let
  Rows = List.Transform({1 .. 10}, each {_}), 
  MyTable = #table(type table [RowNumber = number], Rows), 
  #"Added custom" = Table.TransformColumnTypes(
    Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])), 
    {{"FunctionOutput", Int64.Type}}
  )
in
  #"Added custom"

		

Here’s what the output of the query looks like:

Now, the first important question. How long does this Dataflow take to run? 10 rows calling a function that takes 2 seconds to run, so 10*2=20 seconds maybe? The answer is yes if you look at how long the preview takes to populate in the Dataflow Editor:

That’s just the preview though. If you’re refreshing a Dataflow there are other things that happen that affect how long that refresh takes, such as Staging and loading the data to a destination. There’s no way you can split the performance of your M code from these operations when looking at the duration of a Dataflow refresh in Recent Runs, which explains why some of the timings you will see later in this post seem strange. Don’t worry, though, it doesn’t stop you from seeing the important trends. I’m told that setting a CSV file in a Lakehouse as your data destination is the best way of minimising the impact of loading data on overall refresh durations but at the time of writing the CSV destination can’t be used with Partitioned Compute so all my tests used a Fabric Warehouse as a destination.

Here’s what Recent Runs showed when this Dataflow was refreshed:

The overall refresh time was 59 seconds; the query (called NonPartitioned here) that returns the table of ten rows and which was staged took 29 seconds.

Could this Dataflow benefit from Partitioned Compute? With Partitioned Compute enabled in the Options dialog, I added the necessary M code to the query:

			
let
  Rows = List.Transform({1 .. 10}, each {_}), 
  MyTable = #table(type table [RowNumber = number], Rows), 
  #"Added custom" = Table.TransformColumnTypes(
    Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])), 
    {{"FunctionOutput", Int64.Type}}
  ), 
  ReplacePartitionKey = Table.ReplacePartitionKey(#"Added custom", {"RowNumber"})
in
  ReplacePartitionKey

		

…and then refreshed. Here’s what Recent Runs showed:

The overall refresh duration went up to 1 minute 25 seconds; the query that does all the work (called Partitioned in this case) took 40 seconds. Note the screenshot immediately above shows the engine used is “PartitionedCompute” and that there are now ten activities listed instead of one: my M code used the RowNumber column in the table as the partition key so the Dataflow attempted to run each row of the table as a separate operation in parallel. And as you can see, this made performance worse. This is because using Partitioned Compute introduces yet another overhead and that overhead is much greater than any benefit gained from parallelism in this case.

So I wondered: what if the delay in the query is increased from 2 second to 100 seconds then? Does this increase in the delay mean that parallelism results in faster overall performance?

Here’s what Recent Runs showed for a version of my Dataflow with a 100 second delay for each row and which didn’t use Partitioned Compute:

10 rows * 100 seconds = 1000 seconds = 16 minutes 40 seconds, so it’s not surprising that the overall duration of this version of the Dataflow was slow at 17 minutes 29 seconds.

Here’s what Recent Runs shows for the version of this Dataflow that did use Partitioned Compute:

The overall duration was 4 minutes 41 seconds and the main query took 3 minutes 14 seconds. The important takeaway is that this is a lot faster than the version that didn’t use Partitioned Compute, so clearly Partitioned Compute made a big difference to performance here. As you might expect, it looks like parallelising operations that only take a few seconds results in worse performance while parallelising operations that take longer, say a minute or more, is probably a good idea. As always, you’ll need to test to see what benefits you get for your Dataflows.

These results raise a lot of questions too. 100 seconds = 1 minute 40 seconds, which is a lot less than 3 minutes 14 seconds. Does this mean that not every row in the table was evaluated in parallel? Is partitioning on the RowNumber column counter-productive and would it be better to partition in some other way to try to reduce the amount of attempted parallelism? Is there something else that is limiting the amount of parallelism? While this version of the Dataflow always performs better than the non-partitioned version, performance did vary a lot between refreshes. While these tests show how useful Partitioned Compute can be for slow Dataflows, there’s a lot more research to do and a lot more blog posts to write.

A Closer Look At Preview-Only Steps In Fabric Dataflows

I have been spending a lot of time recently investigating the new performance-related features that have rolled out in Fabric Dataflows over the last few months, so expect a lot of blog posts on this subject in the near future. Probably my favourite of these features is Preview-Only steps: they make such a big difference to my quality of life as a Dataflows developer.

The basic idea (which you can read about in the very detailed docs here) is that you can add steps to a query inside a Dataflow that are only executed when you are editing the query and looking at data in the preview pane; when the Dataflow is refreshed these steps are ignored. This means you can do things like add filters, remove columns or summarise data while you’re editing the Dataflow in order to make the performance of the editor faster or debug data problems. It’s all very straightforward and works well.

The more I thought about this feature, though, the more I wondered about how it actually works and how it can be used in more complex queries that involve hand-written M code – so I decided to do a few tests. First of all, consider the following M query:

			
let
  Step1 = 1,
  Step2 = Step1 - 1
in
  Step2

		

This query returns a numeric value: 0.

At this point it will return the same value in the editor when viewing the preview of the output and when the Dataflow refreshes. But if you right-click on Step2 and select “Enable only in previews” then the query still returns 0 in the preview but will return 1 when the Dataflow refreshes. You will also see this message displayed in the preview pane:

			
This data preview uses preview-only steps. 
Results may differ when running the dataflow.

This makes sense because the query follows a linear pattern: the output references Step2 which in turn references Step1 so if you disable Step2 then the the output simply skips it and returns the value of Step1.

But what if your M code is more complex? For example consider this query:

			
let
  x = 5,
  y = 8,
  xtimesy = x*y,
  outputtable = #table(type table[xy=number],{{xtimesy}})
in
  outputtable

		

Here’s what this returns in the editor: a table that contains the value 40.

What if you disable the step called “y”?

Here’s what I saw in my Warehouse when I loaded the output of the query to it:

I had no idea what the output would be before I saw it but, thinking about it, I assume what has happened is that since the step “y” has been disabled, “y” simply returns the value of the previous step in the query, “x”, and therefore the output becomes 5*5=25. This understanding of how preview-only steps work is backed-up by the fact that it is not possible to set step “x” to be preview-only. The option to do so is greyed out:

This all leads on to a pattern I used in a Dataflow this week and which I can see myself using a lot more in the future. While being able to disable steps when the Dataflow refreshes is very useful, sometimes what you need is to write some conditional logic in your M code that does one thing if you’re in the editor and another thing if the Dataflow is refreshing. Here’s a slightly modified version of the first query above which I have called IsRefresh in my Dataflow:

			
let
  RefreshValue = 1,
  PreviewValue = RefreshValue-1
in
  PreviewValue

		

With the PreviewValue step set to be preview-only:

…then what we have here is a query that returns 1 when the Dataflow is refreshing and 0 when you’re viewing its output in the editor. Here’s an example of how it can be used in a query:

			
let
  Source = #table(
    type table [
      NumericValue = number,
      TextValue    = text
    ],
    {
      {
        IsRefresh,
        if IsRefresh = 1 then
          "This is a refresh"
        else
          "This is a preview"
      }
    }
  )
in
  Source

		

This query returns the following when you preview the output in the editor:

But when the Dataflow refreshes, the output in the destination is this:

Report On SAP And Salesforce Data In Fabric With Business Process Solutions

If you want to build a reporting solution on SAP (S/4HANA or ECC) or Salesforce data in Fabric and don’t want to build everything from scratch then you should check out Business Process Solutions. It’s a free, Microsoft-developed solution currently in public preview; the announcement blog post from last year is here and you can find all the docs here. It’s implemented as a Fabric custom workload which means that you can deploy it to a new workspace easily with just a few clicks, although there is of course a bit of configuration needed so it can connect to your data sources.

After you’ve done that it will generate all the Fabric items (pipelines, semantic models, Power BI reports etc) needed and you can concentrate on analysing your data. I know the team that is building Business Process Solutions and they are very smart so I’m sure what they’ve built is well designed.Check it out!

Power BI, Parallelism And Dependencies Between SQL Queries In DirectQuery Mode

This is going to sound strange, but one of the things I like about tuning Power BI DirectQuery semantic models is that their generally-slower performance and the fact you can see the SQL queries that are generated to get data makes it much easier to understand some of the innermost workings of the Power BI engine. For example this week I was trying to tune a DAX query on a DirectQuery model using DAX Studio and the Server Timings showed me something like this:

As I described here, Power BI can send SQL queries in parallel in DirectQuery mode and you can see from the Timeline column there is some parallelism happening here – the last two SQL queries generated by the DAX query run at the same time – but everything has to wait for that first SQL query to complete. Why? Can this be tuned?

Here’s the scenario that produced the query above. I have a DirectQuery semantic model built from the ContosoDW SQL Server sample database:

There are three base measures defined:

			
Distinct Customers = DISTINCTCOUNT(FactOnlineSales[CustomerKey])
January Customers = 
CALCULATE([Distinct Customers], 
KEEPFILTERS('DimDate'[CalendarMonthLabel]="January"))
Monday Customers = 
CALCULATE([Distinct Customers], 
KEEPFILTERS('DimDate'[CalendarDayOfWeekLabel]="Monday"))

		

Note that these measures are written specifically to prevent fusion from taking place: each measure generates a separate SQL query. Here’s what DAX Studio’s Server Timings shows for the DAX query generated for the table shown above:

As you can see, the three SQL queries generated by the DAX query are run in parallel.

Now consider the following measure:

			
IF Test = IF([Distinct Customers]>3000, [January Customers], [Monday Customers])

Here’s what this measure returns:

If you run the query generated for this visual in DAX Studio, Server Timings shows what I showed in the first screenshot in this post:

The last two substantial SQL queries, on lines 4 and 5, can only run when the first SQL query, on line 1, has finished. The details of SQL queries tell you more about what’s going on here. The first SQL query, on line 1, just gets the values for the [Distinct Customers] measure for all rows in the table:

The WHERE clauses for the SQL queries on line 4:

..and line 5:

…show that these last two queries only get the values for the [January Customers] and [Monday Customers] measures for the rows where the [IF Test] measure needs to display them. And this explains why the first SQL query has to finish before these last two SQL queries can be run: the WHERE clauses of these last two SQL queries are constructed using the results returned by the first SQL query.

There is another way of evaluating the IF condition in the [IF Test] measure. Instead of “strict” evaluation, where the engine only gets the value of [January Customers] for the rows in the table where [Distinct Customers] is greater than 3000 and only gets the value [Monday Customers] for the remaining rows, it can get values for [January Customers] and [Monday Customers] for all rows in the table and then throw away the values it doesn’t need. This is “eager” evaluation and as you would expect, Marco and Alberto have a great article explaining strict and eager evaluation here that is worth reading; Power BI can decide to use either strict or eager evaluation with the IF function depending on which one it thinks will be more efficient. However you can force the use of eager evaluation by using the IF.EAGER DAX function instead of IF:

			
IF EAGER Test = 
IF.EAGER([Distinct Customers]>3000, [January Customers], [Monday Customers])

Here’s what Server Timings shows for the DAX query that uses IF.EAGER:

As you can see, the use of IF.EAGER means that the three substantial SQL queries generated by Power BI for this DAX query can now be run in parallel because there are no dependencies between them: they get the values of [Distinct Customers], [January Customers] and [Monday Customers] for all rows in the table. However, even though these three SQL queries are now run in parallel, it doesn’t result in any performance benefits here because it looks like the three queries are slower as a result of all being run at the same time. Power BI has made the right call to use strict evaluation with the IF function in this case but if you see it using strict evaluation I think it’s worth experimenting with IF.EAGER to see if it performs better – especially in DirectQuery mode where Power BI knows less about the performance characteristics of the database you’re using as your data source.

[Thanks to Phil Seamark for helping me understand this behaviour]

Author: Chris Webb

Generating Sample Data In Fabric Dataflows With FabricAI.Prompt()

Like this:

Power BI Semantic Model Refresh Warnings

Like this:

Power BI And Support For Third Party Semantic Models

Like this:

Role-Playing Dimensions In Fabric Direct Lake Semantic Models Revisited

Like this:

Generating Excel Reports With Fabric Dataflows Gen2

Like this:

Query Folding And Staging In Fabric Dataflows Gen2

Like this:

When Can Partitioned Compute Help Improve Fabric Dataflow Performance?

Like this:

A Closer Look At Preview-Only Steps In Fabric Dataflows

Like this:

Report On SAP And Salesforce Data In Fabric With Business Process Solutions

Like this:

Power BI, Parallelism And Dependencies Between SQL Queries In DirectQuery Mode

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: