Power BI DirectQuery Mode: A Better Choice Than You Might Think

For as long as I’ve been using Power BI – which has been from the beginning – the advice about which storage mode to choose has been the same: use Import mode unless you have a really, really good reason to use DirectQuery mode and even then you’re probably wrong and should use Import mode. Import mode was always a lot faster and a lot easier to tune. Marco’s advice in this LinkedIn post from last year pretty much summed up my attitude and that of every other Power BI expert out there:

DirectQuery was only for situations where you either:

  • Had more data than you could fit in memory, which was very, very rare, or
  • Where you needed to see truly real-time data in your reports, which business stakeholders always claim they want but they rarely ever need

The addition of Direct Lake as a third storage mode in Fabric didn’t change the situation. While there are some really good reasons to use Direct Lake instead of Import mode (which I should really cover in a separate blog post because I don’t think they are explained properly anywhere), it didn’t change the advice around DirectQuery and I still didn’t think DirectQuery was a good option for most people.

I have now changed my mind. Import or Direct Lake should still be the default for most projects but DirectQuery is now the best choice when you’re working with larger data volumes that could still be handled by Import or Direct Lake. This is a controversial statement, I know, and it’s based on experiences with customers that I can’t talk about directly, and as always there are some important details, so let me explain myself.

My new advice is this. If:

  • You are starting a new project that is 100% on Fabric and all of your data will be loaded into a Fabric Lakehouse or Warehouse, and
  • You have the following:
    • Fact tables with more than a couple of billion rows in them, and
    • Dimension tables with more than a couple of million rows in them, and/or
    • Distinct count measures on columns with more than a couple of million distinct values in

Then you should start with a composite model design that uses:

  • DirectQuery mode fact tables on the Warehouse or SQL Endpoint of your Lakehouse
  • Dual mode dimension tables (which means they can swap between Import mode and DirectQuery mode when necessary)
  • Aggregation tables based on your fact tables that are either in Import mode or DirectQuery mode

Why am I recommending this? Well let me take Marco’s objections to DirectQuery mode one by one:

Is Import mode or Direct Lake mode always faster than DirectQuery? For the composite model scenario described above, no, not always. Even if your Import mode or Direct Lake mode model fits within the memory limits of the Fabric capacity that you’re using, then a composite model and DirectQuery fact tables can be almost as fast and may be faster. This is because:

  • If your query hits an aggregation table, which is potentially a lot smaller than the fact table, then it will be faster than if it hits an Import mode fact table. Of course if your aggregation tables are in Import mode then naturally your query will be faster than if it hits an Import mode fact table, so you could argue that this is an Import mode scenario anyway. But the native aggregation functionality is not available in Import mode or Direct Lake mode at the time of writing, it’s only available if the underlying table is in DirectQuery mode. And yes I know you can simulate aggregations in Import mode with some clever DAX but that clever DAX can also carry an overhead.
  • The Fabric Warehouse engine, which not only powers Fabric Warehouse but also the SQL Endpoint of a Lakehouse, has improved a lot in the last year and is now incredibly fast. When the new GPU acceleration and other upcoming performance features land then it will be even faster. For example, it’s already faster at doing distinct counts than the Vertipaq engine used by Import or Direct Lake models.
  • The Warehouse engine also has some architectural advantages over Vertipaq when it comes to concurrency. Now Vertipaq is already really, really good at concurrency and features like semantic model scale-out make it even better, but for the large data volumes mentioned above and when you have more than 5-10 concurrent users then Warehouse is better. It’s actually quite rare to have more than 5-10 genuinely concurrent users (ie users that are running queries at exactly the same time) in a BI solution but it does happen, for example at month-ends.

Is DirectQuery for the rich? Is DirectQuery always more expensive than Import mode or Direct Lake mode? Actually no, not if you’re using a composite model in Fabric for the scenarios we’re talking about.

Let’s say you’re building a large Import mode or Direct Lake mode model in Fabric and you hit the limits of the capacity SKU that you’re using. It could be that you’re hitting the memory limit for your SKU or one of the other Direct Lake guardrails. Or it could be that during your load testing (and you should always do load testing) you hit the CU limits of your capacity. What do you do? Well you can and should do some tuning to see if you can avoid hitting those limits. Or you can scale up to the next capacity size, although that might be expensive. The third option is to use a composite model with DirectQuery fact tables in the way I’ve described.

The first advantage of a composite model with DirectQuery fact tables here is that because only your dimension tables and any Import mode aggregation tables you have are subject to the memory limits for Power BI models that each capacity SKU enforces, you’re much less likely to hit those limits.

The second advantage of a composite model with DirectQuery is that is that if you hit your fact table then Power BI will generate SQL queries against the Warehouse engine and the SQL queries generated are counted as background operations which are then smoothed over 24 hours. DAX queries on Import mode or Direct Lake mode models are counted as interactive operations and are smoothed over 5 to 64 minutes. Queries against a DirectQuery table will still consume some interactive CUs and of course if you hit a Dual mode dimension table or an Import mode aggregation that will also consume interactive CUs, but the Storage Engine is where most of the CUs get burned in an Import or Direct Lake mode model. This is the main reason why a composite model approach can handle more concurrent users: you’re less likely to run into the CU limits for the capacity SKU you’re using because of the difference in how smoothing works.

What’s more, these advantages mean that even if you don’t hit the limits for the capacity SKU you’re using, you could still save money by using a composite model because it could allow you to use a smaller capacity.

You could argue that both of these advantages are purely accounting tricks, results of the rules that we at Microsoft have imposed on how Fabric capacities work, and you’d have a point. But it’s unlikely these rules will change anytime soon.

Is DirectQuery still difficult to manage? Yes, this objection still stands in my opinion. If you need to tune an Import model you need an expert in tuning Import models. They’re rare but they exist – you can call Marco for example or take one of his courses. If you need to tune a composite model like the one I’ve described you need someone who can tune an Import model, someone who can tune a DirectQuery model (which is really rare) and someone who can tune your Fabric Warehouse or SQL Endpoint. You might not have that combination of skills in your team, and if you do then it would make maintenance and development more expensive – which undermines the “DirectQuery is cheaper” argument a bit.

Finally, there are a couple of other important questions that need to be addressed.

I have Databricks, Snowflake or some other database and I’d like to use that instead of Fabric Warehouse or the Lakehouse SQL Endpoint. Is that a good idea? No, and I’m not just saying that because I work on the Fabric team and I want you to believe that Warehouse/SQL Endpoint is faster or better than them. It’s because the underlying architecture of Fabric and the connector that Power BI uses to connect to Fabric Warehouse/SQL Endpoint means that DirectQuery on Warehouse/SQL Endpoint is significantly faster and more scalable, even apart from the performance of the SQL queries themselves. More optimisations are planned to make the “better together” story even more compelling. And no, before anyone suggests it, we’re not deliberately trying to hobble the performance of other, non-Fabric databases – many other Microsoft data sources share the same architectural disadvantages when it comes to DirectQuery as Snowflake and Databricks. And while it is possible to make DirectQuery perform well for any data source, my point is that DirectQuery on Fabric Warehouse is a special case.

Also, much anecdotal evidence suggests that using Power BI in DirectQuery mode on non-Fabric sources can be more expensive than Import mode because you need to pay to use those other sources as well as pay for your Fabric capacity. DirectQuery mode on non-Fabric databases is more expensive in CU terms than DirectQuery on Fabric Warehouse/SQL Endpoint because of those architectural differences I mentioned and I’ve seen it even be more expensive than an Import model in terms of CU consumption. Meanwhile, as I’ve said, for larger volumes DirectQuery on Warehouse/SQL Endpoint can be cheaper than Import mode or Direct Lake.

Are there any other advantages to DirectQuery mode? Yes but maybe not ones you care about. Do you want genuinely dynamic calculated columns? It’s been possible for years with DirectQuery (see here – the example uses KQL but similar things are possible in SQL). One day I’ll get my demo showing how to do proper time zone conversion, handling daylight savings time, in DirectQuery. My colleague Mark Pryce-Maher has a nice demo of calling Fabric AI functions in a DirectQuery model. But these are all niche use cases.

Your size recommendations for when to use DirectQuery above are very vague. Can’t you be more precise? No, because so much depends on the design of your semantic model and reports and the nature of your data. In an ideal world you would still always choose Import mode or Direct Lake mode as your default and only shift to a DirectQuery composite model when necessary but doing that halfway through a project is quite disruptive. If you want to test the performance of Import mode, Direct Lake mode and DirectQuery mode for your project then go ahead, but if you don’t then for the volumes I’m talking about there’s a very good chance DirectQuery is the best option.

Last of all, I need to stress that this recommendation could change in the future. While Fabric Warehouse is getting a lot of improvements and optimisations, so is Direct Lake mode, so the price and performance characteristics of both will change a lot and that means the decision about which one is the best choice for larger data volumes may change too. I promise to update this post if and when that happens.

Improve Performance Of Staging In Fabric Dataflows Gen2 By Disabling V-Order

Quite a few new Dataflows Gen2 features were released recently without much fanfare, but that doesn’t mean they aren’t important. I will take a look at them all in my next few posts; in this first post I’ll look at the ability to disable V-Order on staged data.

As the (very detailed) documentation for this new feature describes, V-Order is a write-time optimisation for the parquet files that underpin the Delta tables that OneLake uses to store data. It slows down writing data to the tables but means that reading data from them, for example in Power BI Direct Lake mode, is much faster. It used to be the case that when you staged data inside a dataflow that data always had V-Order applied; now you have the option to disable V-Order. Disabling V-Order makes staging faster and because staged data is rarely queried more than a few times, disabling V-Order usually improves overall refresh performance.

To test this I created a simple dataflow that connected to a large (5.58GB) CSV file that contained 17.6 million rows of data, staged the data in a query called StageData, then did a group by on that data in a second query called GroupBy.

I turned off Fast Copy and left the “Enable V-Order compression” setting on:

[At the time of writing this post the ability to disable V-Order only works when Fast Copy is not used – I expect this to change in the future]

I refreshed the dataflow and it took 1 minute 59 seconds. The StageData query (where the staging takes place) took 1 minute 31 seconds; the GroupBy query took 12 seconds.

I then disabled V-Order compression for staging:

…and refreshed again. This time overall refresh took 1 minute 32 seconds, the StageData query took 1 minute 13 seconds and the GroupBy query took 7 seconds. While there is always a certain amount of variation in dataflow refresh timings it’s clear that disabling V-Order resulting in staging being about 20 seconds faster with no reduction in performance of the group by transformation on the staged data. So, in this case at least, disabling V-Order was a good thing for refresh performance.

When you decide whether to use staging in a dataflow you have to test to see whether the extra time needed to stage the data is worth it compared to the performance improvements you get by doing transformations on the staged data (which mostly come from those transformations having the opportunity to be folded). Since turning off V-Order makes staging faster it means that staging is more useful and will result in better overall dataflow refresh performance more often.

Fabric Dataflows Gen2 And Concurrent Evaluation, Part 3: Is More Parallelism Always Good?

To finish off my series of posts on concurrent evaluation in Fabric Dataflows Gen2 (see part 1 and part 2) I decided to do some more realistic tests to see how much parallelism I could get. To do this I uploaded 244 identical Excel files containing almost 542000 rows of data each to a SharePoint document library. Excel files are probably the worst-performing file format for dataflows (see here for some tests that show this), while SharePoint is probably the worst-performing place to store data for a dataflow and also has a reputation for throttling applications that make too many requests.

As a baseline I created a Dataflow Gen2 that combined all the data from all the Excel files in the document library and loaded it into a single table in a Fabric Warehouse; I configured this dataflow up so it did not use Partitioned Compute.

Looking at the Recent Runs data for this dataflow I could see that the query that combined the data from all the Excel files took on average 28 minutes 8 seconds to run.

I then created a copy of this dataflow and modified it so that it did use Partitioned Compute. I set the column containing the Excel filename as the partition key which meant that, in theory, the dataflow could read the data from each of the files in parallel. I then refreshed the dataflow with different maximum concurrency settings (see the previous post in this series for how to do this). Here are the results:

Maximum ConcurrencyAverage Refresh Duration (Minutes:Seconds)
418:37
89:42
165:24
2505:21
No limit set5:16

A few interesting things to note:

  • Refreshes with a maximum concurrency of 4 were significantly faster than the baseline although nowhere near four times faster, possibly because of the overhead introduced by Partitioned Compute, but
  • Refreshes with a maximum concurrency of 8 were around twice as fast as refreshes with a maximum concurrency of 4, and refreshes with a maximum concurrency of 16 were almost four times as fast
  • There was no performance gain achieved by setting a maximum concurrency of more than 16
  • Setting a limit of more than 16 resulted in the same performance as not setting a limit at all

One last thing to mention is that one of the refreshes when no limit was set failed after two minutes. It was difficult to tell why it failed but after downloading the logs and looking through them it seems that SharePoint was returning a 429 Too Many Requests error, which makes sense given that the dataflow was trying to read data from a lot of Excel files at the same time. Given that no data source, especially SharePoint, can be expected to handle an infinite number of parallel requests for data from a dataflow then this is a good example of where too much parallelism can be a bad thing and cause errors, and where it’s important to limit the amount of parallelism inside the dataflow.

Generating Sample Data In Fabric Dataflows With FabricAI.Prompt()

Back in December the FabricAI.Prompt() M function was released in Fabric Dataflows Gen2. Most of the people writing about it at that time, as in this great post by my colleague Sandeep Pawar, focused on calling this function for each row in a table – something that the UI in the editor makes easy. However the FabricAI.Prompt() function itself is a lot more flexible. You can use it to summarise whole tables of data as I showed here; you can also use it to generate sample data. This is similar to what I blogged about here where I got Copilot to generate M code that returned sample data but using FabricAI.Prompt() is maybe a bit simpler.

The trick is to get FabricAI.Prompt() to generate a table of data in CSV format. Here’s an example prompt:

Generate a table of sample sales data in CSV format.
The table should have three columns called Country, Product and Sales.
The Country column should contain the names of random European countries.
The Product column should contain the names of random types of fruit.
The Sales column should contain random numbers between 1 and 100.
The table should contain 10 rows.
The first row of text returned should contain the column names.
Subsequent rows should contain the data.

…and here’s how this prompt can be used with Fabric.AIPrompt() and how the text that the function returns can be turned into a table using the CSV.Document function:

let
Source = FabricAI.Prompt(
"Generate a table of sample sales data in CSV format. The table should have three columns called Country, Product and Sales. The Country column should contain the names of random European countries. The Product column should contain the names of random types of fruit. The Sales column should contain random numbers between 1 and 100. The table should contain 10 rows. The first row of text returned should contain the column names. Subsequent rows should contain the data."
),
ToCSV = Csv.Document(Source),
#"Promoted headers" = Table.PromoteHeaders(
ToCSV,
[PromoteAllScalars = true]
),
#"Changed column type"
= Table.TransformColumnTypes(
#"Promoted headers",
{
{"Country", type text},
{"Product", type text},
{"Sales", Int64.Type}
}
)
in
#"Changed column type"

And here’s an example of the result:

Role-Playing Dimensions In Fabric Direct Lake Semantic Models Revisited

Back in September 2024 I wrote a blog post on how to create multiple copies of the same dimension in a Direct Lake semantic model without creating copies of the underlying Delta table. Not long after that I started getting comments that people who tried following my instructions were getting errors, and while some bugs were fixed others remained. After asking around I have a workaround (thank you Kevin Moore) that will avoid all those errors, so while we’re waiting for the remaining fixes here are the details of the workaround.

Let’s say you have a Direct Lake on OneLake semantic model with two tables, a fact table called Conversation and a dimension table called Person. The Conversation fact table has one row for a conversation between two people, but at this point there is only one Person dimension in the model with a relationship from the FromPersonId column on Conversation to the PersonId column on Person:

How can you add a second copy of the Person dimension table without duplicating the data in OneLake?

In Power BI Desktop, while editing the semantic model, go to TMDL View and in the Data pane on the right hand side switch to the Model pane:

Expand Expressions and drag it into the TMDL pane to script it out. It should look something like this:

Then you need to make two changes:

  • On the line that starts “expression”, line 3 in the screenshot above, change the name of the expression to something new and unique
  • Delete the line that contains the lineage tag, line 8 in the screenshot above

Here’s what it should look like after:

Click Apply and this will create a duplicate Expression in the model. This is the trick that makes everything else work.

Next, create a new script in TMDL View and drag the Person dimension into it to script it out.

Then make the following changes to the script:

  • On the line that starts “table”, line 3 in the screenshot above, change the name of the table to something new and unique
  • One the line that starts “expressionSource”, line 31 in the screenshot above, change the name of the source expression to that of the new Expression created earlier
  • Delete all lines with lineage tags, ie those that start “lineageTag”
  • Add one line at the end with the text “changedProperty = Name”

Here’s what it should look like after:

Click Apply and this will create a copy of the dimension in the semantic model.

Then, back in Diagram View, you’ll see the new dimension table but with a warning saying that it hasn’t been refreshed. The next step is to refresh the model using the Schema and Data option:

At this point the new dimension table can be used like any other table, so you can create the relationship between the ToPersonId column on Conversation and the PersonId column on the new ToPerson dimension:

Generating Excel Reports With Fabric Dataflows Gen2

So many cool Fabric features get announced at Fabcon that it’s easy to miss some of them. The fact that you can now not only generate Excel files from Fabric Dataflows Gen2, but that you have so much control over the format that you can use this feature to build simple reports rather than plain old data dumps, is a great example: it was only mentioned halfway through this blog post on new stuff in Dataflows Gen2 Nonethless it was the Fabcon feature announcement that got me most excited. This is because it shows how Fabric Dataflows Gen2 have gone beyond being just a way to bring data into Fabric and are now a proper self-service ETL tool where you can extract data from a lot of different sources, transform it using Power Query, and load it to a variety of destinations both inside Fabric and outside it (such as CSV files, Snowflake and yes, Excel).

The documentation for the new Excel destination, which you can find here, is extremely detailed indeed so I thought it would be useful to show a simple example of how you can now use Dataflows Gen2 to build an Excel report. First of all I created a query using the Enter Data source that returned a table with some sales data in:

let
Source = Table.FromRows(
Json.Document(
Binary.Decompress(
Binary.FromText(
"i45WciwoyEktVtJRMlWK1YlW8i9KzEsH8w0NwAIBqYlFYK4RmOtelFgAkbZUio0FAA==",
BinaryEncoding.Base64
),
Compression.Deflate
)
),
let
_t = ((type nullable text) meta [Serialized.Text = true])
in
type table [Product = _t, Sales = _t]
),
#"Changed column type" = Table.TransformColumnTypes(
Source,
{{"Product", type text}, {"Sales", Int64.Type}}
)
in
#"Changed column type"

I then created a query called ReportTitle that contained the text for my report’s title:

let
Source = #table({"Title"},{{"My Sales Report"}})
in
Source

…and a query called FruitSalesOverview that passes the data from the FruitSales query to the FabricAI.Prompt M function to generate a text summary of it:

let
Source = FabricAI.Prompt("Summarise the fruit sales data in 20 words or less", FruitSales),
ToTable = #table({"Summary"}, {{Source}})
in
ToTable

The last query I created, called Output, generated a navigation table in the format described in the docs to describe the Excel output: the report title in a range starting in cell B1, the report summary in a range starting in cell B3, the sales data in a table starting in cell B5 and a bar chart showing the sales data.

let
excelDocument = #table(
type table [
Sheet = nullable text,
Name = nullable text,
PartType = nullable text,
Properties = nullable record,
Data = any
],
{
// Report title
{"Sales", "Title", "Range", [StartCell = "B1", SkipHeader = true], ReportTitle},
// Copilot-generated summary
{"Sales", "Summary", "Range", [StartCell = "B3", SkipHeader = true], FruitSalesOverview},
// Table containing sales data
{
"Sales",
"SalesTable",
"Table",
[StartCell = "B5", TableStyle = "TableStyleMedium9"],
FruitSales
},
//Column chart containing sales data
{
"Sales",
"SalesChart",
"Chart",
[
ChartType = "Column",
ChartTitle = "Fruit Sales",
DataSeries = [AxisColumns = {"Product"}, ValueColumns = {"Sales"}]
],
#table({}, {}) meta [Name = "SalesTable"]
}
}
)
in
excelDocument

I then set the Data Destination of the Output query to use the New File option to create an Excel file and save it to the Files section of a Fabric Lakehouse. The use of the Advanced format option meant that the navigation table returned by the Output query was used to determine the structure of the resulting Excel file.

After refreshing the Dataflow I downloaded the resulting Excel workbook from my Lakehouse. Here’s what it looked like:

Pretty fun. Does it give you full control over the format of the Excel file? No, not quite. Is it a somewhat code-heavy approach? Yes, but I suppose in the age of AI that doesn’t matter so much since you’re unlikely to write your own code (although, being old-school, I adapted the code above from the docs manually). Most importantly: is this a better way of dumping data to Excel and/or generating simple Excel reports in Fabric than paginated reports? Good question, especially since Power Query is now available in paginated reports. I suspect the answer is that although paginated reports are harder to build (though you can generate rdl with AI too, I’ve done it) and that it’s harder to control what a paginated report rendered as an Excel file looks like, paginated reports may still have the edge if you want an Excel report but I’m not sure – factors like CU cost and how long it takes to generate an Excel file using each approach would also need to be taken into account. If you just want to dump data to Excel, however, Dataflows Gen2 are probably a better option now.

Query Folding And Staging In Fabric Dataflows Gen2

A few years ago I wrote this post on the subject of staging in Fabric Dataflows Gen2. In it I explained what staging is, how you can enable it for a query inside a Dataflow, and discussed the pros and cons of using it. However one thing I never got round to doing until this week is looking at how you can tell if query folding is happening on staged data inside a Dataflow – which turns out to be harder to do than you might think.

Consider the following simple Dataflow consisting of two queries:

The first query, called StageData, reads data from a 6GB, 17 million row CSV file containing open data from the English Prescribing Data dataset. It returns two columns called PRACTICE_NAME and TOTAL_QUANTITY from that CSV file:

Staging is enabled on this query:

The second query, called GroupBy, takes the data returned by the StageData query and does a Group By operation to get the sum of the values in the TOTAL_QUANTITY column for each distinct value in PRACTICE_NAME:

The output of this query was loaded to a Fabric Warehouse:

The scenario is basically the same as the one from my previous post but with a much larger data volume, and the idea was to test again whether it was faster to stage the data and do the Group By on the staged data or to not stage the data and do the Group By while reading the data direct from the source.

It turned out that, once again, staging made performance worse (don’t worry, I have other tests that show it can help performance). But the point about staging is that by loading the data from a query into a hidden Fabric Lakehouse, managed by the Dataflow (which is what is meant by “staging”), any subsequent operations on this data are faster because query folding can take place against the SQL Endpoint of this hidden Lakehouse – and at the time of writing this post there’s no way of knowing from the Dataflows Editor whether query folding is taking place. Right-clicking on the step that does the Group By operation shows that the “View data source query” option is greyed out but this only tells you that you the Editor doesn’t know if folding is taking place:

In fact other things in the UI, such as the query plan and the query folding indicators, suggest incorrectly that folding is not taking place:

So I thought: if query folding is taking place then the Group By will result in a SQL query run against the hidden Lakehouse, so maybe I can see this SQL query somewhere? Unfortunately since the Lakehouse is hidden you can’t get to it through the Fabric web interface. But then I remembered that you can connect to a Fabric workspace using good old SQL Server Management Studio (instructions on how to can be found here). And when I connected using SSMS I could see two hidden objects created by by Dataflow called StagingLakehouseForDataflows and StagingWarehouseForDataflows:

I was then able to run a SQL query using the queryinsights.exec_requests_history DMV against StagingWarehouseForDataflows, filtered for the time range when my Dataflow refresh was taking place:

SELECT start_time,statement_type, command, total_elapsed_time_ms
FROM
queryinsights.exec_requests_history
WHERE
start_time>'insert DF refresh start time here'
AND end_time<'insert DF refresh end time here'
ORDER BY start_time desc

…and saw the following INSERT INTO statement that did the Group By operation I was expecting to see along with how long it took:

insert into
[StagingWarehouseForDataflows_20260223144925].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_2930fd3c_002D2a62_002D4518_002Dafbf_002D249e7af54403]
([Column1], [Column2])
select [_].[Column1] as [Column1],
convert(float, [_].[Total Quantity]) as [Column2]
from (
select [rows].[Column1] as [Column1],
sum([rows].[Column2]) as [Total Quantity]
from [StagingLakehouseForDataflows_20260223144911].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_179186be_002D367d_002D4924_002Da8ba_002Dd1f220415e3a]
as [rows]
group by [Column1] )
as [_]

So, a useful tip if you’re performance tuning a Dataflow even if it’s a bit of a pain to do. Hopefully in the future we’ll be able to see the SQL generated when query folding takes place against a staged table.

[Thanks to Miguel Escobar for his help with this]

A Closer Look At Preview-Only Steps In Fabric Dataflows

I have been spending a lot of time recently investigating the new performance-related features that have rolled out in Fabric Dataflows over the last few months, so expect a lot of blog posts on this subject in the near future. Probably my favourite of these features is Preview-Only steps: they make such a big difference to my quality of life as a Dataflows developer.

The basic idea (which you can read about in the very detailed docs here) is that you can add steps to a query inside a Dataflow that are only executed when you are editing the query and looking at data in the preview pane; when the Dataflow is refreshed these steps are ignored. This means you can do things like add filters, remove columns or summarise data while you’re editing the Dataflow in order to make the performance of the editor faster or debug data problems. It’s all very straightforward and works well.

The more I thought about this feature, though, the more I wondered about how it actually works and how it can be used in more complex queries that involve hand-written M code – so I decided to do a few tests. First of all, consider the following M query:

let
Step1 = 1,
Step2 = Step1 - 1
in
Step2

This query returns a numeric value: 0.

At this point it will return the same value in the editor when viewing the preview of the output and when the Dataflow refreshes. But if you right-click on Step2 and select “Enable only in previews” then the query still returns 0 in the preview but will return 1 when the Dataflow refreshes. You will also see this message displayed in the preview pane:

This data preview uses preview-only steps.
Results may differ when running the dataflow.

This makes sense because the query follows a linear pattern: the output references Step2 which in turn references Step1 so if you disable Step2 then the the output simply skips it and returns the value of Step1.

But what if your M code is more complex? For example consider this query:

let
x = 5,
y = 8,
xtimesy = x*y,
outputtable = #table(type table[xy=number],{{xtimesy}})
in
outputtable

Here’s what this returns in the editor: a table that contains the value 40.

What if you disable the step called “y”?

Here’s what I saw in my Warehouse when I loaded the output of the query to it:

I had no idea what the output would be before I saw it but, thinking about it, I assume what has happened is that since the step “y” has been disabled, “y” simply returns the value of the previous step in the query, “x”, and therefore the output becomes 5*5=25. This understanding of how preview-only steps work is backed-up by the fact that it is not possible to set step “x” to be preview-only. The option to do so is greyed out:

This all leads on to a pattern I used in a Dataflow this week and which I can see myself using a lot more in the future. While being able to disable steps when the Dataflow refreshes is very useful, sometimes what you need is to write some conditional logic in your M code that does one thing if you’re in the editor and another thing if the Dataflow is refreshing. Here’s a slightly modified version of the first query above which I have called IsRefresh in my Dataflow:

let
RefreshValue = 1,
PreviewValue = RefreshValue-1
in
PreviewValue

With the PreviewValue step set to be preview-only:

…then what we have here is a query that returns 1 when the Dataflow is refreshing and 0 when you’re viewing its output in the editor. Here’s an example of how it can be used in a query:

let
Source = #table(
type table [
NumericValue = number,
TextValue = text
],
{
{
IsRefresh,
if IsRefresh = 1 then
"This is a refresh"
else
"This is a preview"
}
}
)
in
Source

This query returns the following when you preview the output in the editor:

But when the Dataflow refreshes, the output in the destination is this:

Report On SAP And Salesforce Data In Fabric With Business Process Solutions

If you want to build a reporting solution on SAP (S/4HANA or ECC) or Salesforce data in Fabric and don’t want to build everything from scratch then you should check out Business Process Solutions. It’s a free, Microsoft-developed solution currently in public preview; the announcement blog post from last year is here and you can find all the docs here. It’s implemented as a Fabric custom workload which means that you can deploy it to a new workspace easily with just a few clicks, although there is of course a bit of configuration needed so it can connect to your data sources.

After you’ve done that it will generate all the Fabric items (pipelines, semantic models, Power BI reports etc) needed and you can concentrate on analysing your data. I know the team that is building Business Process Solutions and they are very smart so I’m sure what they’ve built is well designed.Check it out!

Monitor Fabric Costs With Fabric Cost Analysis

Following on from my blog post a few months ago about cool stuff in the Fabric Toolbox, there is now another really useful solution available there that anyone with Fabric capacities should check out: Fabric Cost Analysis (or FCA). If you have Fabric capacities it’s important to be able to monitor your Azure costs relating to them, so why not monitor your Fabric costs using a solution built using Fabric itself? This is what the folks behind FCA (who include Romain Casteres, author of this very useful blog post on FinOps for Fabric, plus Cédric Dupui, Manel Omani and Antoine Richet) decided to build and share freely with the community.

A lot of enhancements are planned for the future – I’m told there’s another big release planned for the end of October 2025 – but it’s already quite mature. Not only do you get all the ETL to extract cost data from Azure and Power BI reports you need but there’s also a Fabric Data Agent so you can query your data in natural language!

Aren’t Fabric costs straightforward? Don’t you just pay a flat fee for each capacity? In a lot of cases yes, but there can be complications: you might be pausing and resuming capacities or scaling them up or down. There can also be other costs you might not realise you’re incurring. For example if you’ve ever read any of Matthew Farrow’s excellent content on Fabric costs, such as this post on the cost implications of pausing a capacity, and wondered whether you’ve been charged extra for pausing a throttled capacity, then FCA can answer that question:

Like FUAM this is not an official Microsoft product but a solution accelerator with no official support, but it’s definitely a much better option than building something yourself or not monitoring your costs at all.