Improve Performance Of Staging In Fabric Dataflows Gen2 By Disabling V-Order

Quite a few new Dataflows Gen2 features were released recently without much fanfare, but that doesn’t mean they aren’t important. I will take a look at them all in my next few posts; in this first post I’ll look at the ability to disable V-Order on staged data.

As the (very detailed) documentation for this new feature describes, V-Order is a write-time optimisation for the parquet files that underpin the Delta tables that OneLake uses to store data. It slows down writing data to the tables but means that reading data from them, for example in Power BI Direct Lake mode, is much faster. It used to be the case that when you staged data inside a dataflow that data always had V-Order applied; now you have the option to disable V-Order. Disabling V-Order makes staging faster and because staged data is rarely queried more than a few times, disabling V-Order usually improves overall refresh performance.

To test this I created a simple dataflow that connected to a large (5.58GB) CSV file that contained 17.6 million rows of data, staged the data in a query called StageData, then did a group by on that data in a second query called GroupBy.

I turned off Fast Copy and left the “Enable V-Order compression” setting on:

[At the time of writing this post the ability to disable V-Order only works when Fast Copy is not used – I expect this to change in the future]

I refreshed the dataflow and it took 1 minute 59 seconds. The StageData query (where the staging takes place) took 1 minute 31 seconds; the GroupBy query took 12 seconds.

I then disabled V-Order compression for staging:

…and refreshed again. This time overall refresh took 1 minute 32 seconds, the StageData query took 1 minute 13 seconds and the GroupBy query took 7 seconds. While there is always a certain amount of variation in dataflow refresh timings it’s clear that disabling V-Order resulting in staging being about 20 seconds faster with no reduction in performance of the group by transformation on the staged data. So, in this case at least, disabling V-Order was a good thing for refresh performance.

When you decide whether to use staging in a dataflow you have to test to see whether the extra time needed to stage the data is worth it compared to the performance improvements you get by doing transformations on the staged data (which mostly come from those transformations having the opportunity to be folded). Since turning off V-Order makes staging faster it means that staging is more useful and will result in better overall dataflow refresh performance more often.

Fabric Dataflows Gen2 And Concurrent Evaluation, Part 3: Is More Parallelism Always Good?

To finish off my series of posts on concurrent evaluation in Fabric Dataflows Gen2 (see part 1 and part 2) I decided to do some more realistic tests to see how much parallelism I could get. To do this I uploaded 244 identical Excel files containing almost 542000 rows of data each to a SharePoint document library. Excel files are probably the worst-performing file format for dataflows (see here for some tests that show this), while SharePoint is probably the worst-performing place to store data for a dataflow and also has a reputation for throttling applications that make too many requests.

As a baseline I created a Dataflow Gen2 that combined all the data from all the Excel files in the document library and loaded it into a single table in a Fabric Warehouse; I configured this dataflow up so it did not use Partitioned Compute.

Looking at the Recent Runs data for this dataflow I could see that the query that combined the data from all the Excel files took on average 28 minutes 8 seconds to run.

I then created a copy of this dataflow and modified it so that it did use Partitioned Compute. I set the column containing the Excel filename as the partition key which meant that, in theory, the dataflow could read the data from each of the files in parallel. I then refreshed the dataflow with different maximum concurrency settings (see the previous post in this series for how to do this). Here are the results:

Maximum ConcurrencyAverage Refresh Duration (Minutes:Seconds)
418:37
89:42
165:24
2505:21
No limit set5:16

A few interesting things to note:

  • Refreshes with a maximum concurrency of 4 were significantly faster than the baseline although nowhere near four times faster, possibly because of the overhead introduced by Partitioned Compute, but
  • Refreshes with a maximum concurrency of 8 were around twice as fast as refreshes with a maximum concurrency of 4, and refreshes with a maximum concurrency of 16 were almost four times as fast
  • There was no performance gain achieved by setting a maximum concurrency of more than 16
  • Setting a limit of more than 16 resulted in the same performance as not setting a limit at all

One last thing to mention is that one of the refreshes when no limit was set failed after two minutes. It was difficult to tell why it failed but after downloading the logs and looking through them it seems that SharePoint was returning a 429 Too Many Requests error, which makes sense given that the dataflow was trying to read data from a lot of Excel files at the same time. Given that no data source, especially SharePoint, can be expected to handle an infinite number of parallel requests for data from a dataflow then this is a good example of where too much parallelism can be a bad thing and cause errors, and where it’s important to limit the amount of parallelism inside the dataflow.

Fabric Dataflows Gen2 And Concurrent Evaluation, Part 2: Partitioned Compute

In the first part of this series I showed how the Concurrency setting in a Fabric Dataflows Gen2 can affect refresh performance when there are multiple queries inside the dataflow. In this post I will show how, with Partitioned Compute, this setting can also affect the performance of a single query within a dataflow.

To test this I created a dataflow with one query, a modified version of the query that I used in this post from earlier this year which returns a table with ten rows and calls a function with a built-in delay of 60 seconds on each row.

Refreshing the dataflow with Partitioned Compute enabled and with the default concurrency settings:

Refreshing this dataflow took 1 minute 47 seconds:

Looking at the details for just the query with the delay, I could see that it took 1 minute 19 seconds – which suggests that every row was evaluated in parallel:

I don’t know if a concurrency of ten is something that will be achievable every time; based on what I saw in the tests from my previous posts the amount of concurrency with the default setting is variable.

I then changed the Concurrency setting to 1:

And then refreshed the dataflow again. This time the refresh took 11 minutes 31 seconds:

Looking at the details for the query with the delay I could see that it took 11 minutes 9 seconds:

This suggests that each row in the query was evaluated one after the other and shows that the Concurrency setting can be used to control the number of concurrent evaluations within a single query. As Partitioned Compute gets introduced in more and more scenarios in Fabric Dataflows Gen2 being able to control concurrency and tune performance will be more and more important.

Fabric Dataflows Gen2 And Concurrent Evaluation, Part 1

Did you know that if your Fabric Dataflows Gen2 contains several queries then you can control how many of them are evaluated in parallel when your dataflow refreshes? In this series I’ll look at how how you can do this and how it may result in better performance – at least in some cases.

Let’s start with the basics. I created a Dataflow Gen2 with ten queries which each returned a table of one row and one column after one minute. I used the #table function to generate the table without connecting to a data source, code from this post to add the delay and the trick in this post to make sure the delay was only applied when the dataflow refreshed. The output of each query was loaded to a Fabric Warehouse.

I then refreshed the dataflow using the default settings and found that the refresh took 2 minutes 9 seconds by looking in Recent Runs; each individual query took somewhere between 1 minute 10 seconds and 1 minute 30 seconds, which matches the 10-30 second overhead on query execution that I normally see when tuning dataflows. I refreshed it two more times and the durations were 1 minute 32 seconds and 3 minutes 38 seconds. This all suggests that the amount of concurrency was variable, and that the dataflow was sometimes able to evaluate all the queries in parallel and sometimes only able to evaluate some of them in parallel.

I then opened the dataflow, went to the Options dialog and the Scale pane, checked the “Limit number of concurrent evaluations” box and set the slider to 1:

I then refreshed the dataflow again and this time the overall refresh took 12 minutes 29 seconds, which is consistent with ten query evaluations that took 1 minute 10 seconds to 1 minute 30 seconds running one after the other.

Finally, I went back into the dataflow and set the Concurrency slider as high as possible to 250:

I then refreshed the dataflow again and this time the refresh took 1 minute 33 seconds, suggesting all the queries were evaluated in parallel.

So from all this we can see that, as you might expect, increasing the amount of concurrency improved dataflow refresh performance. There’s more to learn about concurrency though and as we shall learn later in this series, more concurrency isn’t always better.

Generating Sample Data In Fabric Dataflows With FabricAI.Prompt()

Back in December the FabricAI.Prompt() M function was released in Fabric Dataflows Gen2. Most of the people writing about it at that time, as in this great post by my colleague Sandeep Pawar, focused on calling this function for each row in a table – something that the UI in the editor makes easy. However the FabricAI.Prompt() function itself is a lot more flexible. You can use it to summarise whole tables of data as I showed here; you can also use it to generate sample data. This is similar to what I blogged about here where I got Copilot to generate M code that returned sample data but using FabricAI.Prompt() is maybe a bit simpler.

The trick is to get FabricAI.Prompt() to generate a table of data in CSV format. Here’s an example prompt:

Generate a table of sample sales data in CSV format.
The table should have three columns called Country, Product and Sales.
The Country column should contain the names of random European countries.
The Product column should contain the names of random types of fruit.
The Sales column should contain random numbers between 1 and 100.
The table should contain 10 rows.
The first row of text returned should contain the column names.
Subsequent rows should contain the data.

…and here’s how this prompt can be used with Fabric.AIPrompt() and how the text that the function returns can be turned into a table using the CSV.Document function:

let
Source = FabricAI.Prompt(
"Generate a table of sample sales data in CSV format. The table should have three columns called Country, Product and Sales. The Country column should contain the names of random European countries. The Product column should contain the names of random types of fruit. The Sales column should contain random numbers between 1 and 100. The table should contain 10 rows. The first row of text returned should contain the column names. Subsequent rows should contain the data."
),
ToCSV = Csv.Document(Source),
#"Promoted headers" = Table.PromoteHeaders(
ToCSV,
[PromoteAllScalars = true]
),
#"Changed column type"
= Table.TransformColumnTypes(
#"Promoted headers",
{
{"Country", type text},
{"Product", type text},
{"Sales", Int64.Type}
}
)
in
#"Changed column type"

And here’s an example of the result:

Generating Excel Reports With Fabric Dataflows Gen2

So many cool Fabric features get announced at Fabcon that it’s easy to miss some of them. The fact that you can now not only generate Excel files from Fabric Dataflows Gen2, but that you have so much control over the format that you can use this feature to build simple reports rather than plain old data dumps, is a great example: it was only mentioned halfway through this blog post on new stuff in Dataflows Gen2 Nonethless it was the Fabcon feature announcement that got me most excited. This is because it shows how Fabric Dataflows Gen2 have gone beyond being just a way to bring data into Fabric and are now a proper self-service ETL tool where you can extract data from a lot of different sources, transform it using Power Query, and load it to a variety of destinations both inside Fabric and outside it (such as CSV files, Snowflake and yes, Excel).

The documentation for the new Excel destination, which you can find here, is extremely detailed indeed so I thought it would be useful to show a simple example of how you can now use Dataflows Gen2 to build an Excel report. First of all I created a query using the Enter Data source that returned a table with some sales data in:

let
Source = Table.FromRows(
Json.Document(
Binary.Decompress(
Binary.FromText(
"i45WciwoyEktVtJRMlWK1YlW8i9KzEsH8w0NwAIBqYlFYK4RmOtelFgAkbZUio0FAA==",
BinaryEncoding.Base64
),
Compression.Deflate
)
),
let
_t = ((type nullable text) meta [Serialized.Text = true])
in
type table [Product = _t, Sales = _t]
),
#"Changed column type" = Table.TransformColumnTypes(
Source,
{{"Product", type text}, {"Sales", Int64.Type}}
)
in
#"Changed column type"

I then created a query called ReportTitle that contained the text for my report’s title:

let
Source = #table({"Title"},{{"My Sales Report"}})
in
Source

…and a query called FruitSalesOverview that passes the data from the FruitSales query to the FabricAI.Prompt M function to generate a text summary of it:

let
Source = FabricAI.Prompt("Summarise the fruit sales data in 20 words or less", FruitSales),
ToTable = #table({"Summary"}, {{Source}})
in
ToTable

The last query I created, called Output, generated a navigation table in the format described in the docs to describe the Excel output: the report title in a range starting in cell B1, the report summary in a range starting in cell B3, the sales data in a table starting in cell B5 and a bar chart showing the sales data.

let
excelDocument = #table(
type table [
Sheet = nullable text,
Name = nullable text,
PartType = nullable text,
Properties = nullable record,
Data = any
],
{
// Report title
{"Sales", "Title", "Range", [StartCell = "B1", SkipHeader = true], ReportTitle},
// Copilot-generated summary
{"Sales", "Summary", "Range", [StartCell = "B3", SkipHeader = true], FruitSalesOverview},
// Table containing sales data
{
"Sales",
"SalesTable",
"Table",
[StartCell = "B5", TableStyle = "TableStyleMedium9"],
FruitSales
},
//Column chart containing sales data
{
"Sales",
"SalesChart",
"Chart",
[
ChartType = "Column",
ChartTitle = "Fruit Sales",
DataSeries = [AxisColumns = {"Product"}, ValueColumns = {"Sales"}]
],
#table({}, {}) meta [Name = "SalesTable"]
}
}
)
in
excelDocument

I then set the Data Destination of the Output query to use the New File option to create an Excel file and save it to the Files section of a Fabric Lakehouse. The use of the Advanced format option meant that the navigation table returned by the Output query was used to determine the structure of the resulting Excel file.

After refreshing the Dataflow I downloaded the resulting Excel workbook from my Lakehouse. Here’s what it looked like:

Pretty fun. Does it give you full control over the format of the Excel file? No, not quite. Is it a somewhat code-heavy approach? Yes, but I suppose in the age of AI that doesn’t matter so much since you’re unlikely to write your own code (although, being old-school, I adapted the code above from the docs manually). Most importantly: is this a better way of dumping data to Excel and/or generating simple Excel reports in Fabric than paginated reports? Good question, especially since Power Query is now available in paginated reports. I suspect the answer is that although paginated reports are harder to build (though you can generate rdl with AI too, I’ve done it) and that it’s harder to control what a paginated report rendered as an Excel file looks like, paginated reports may still have the edge if you want an Excel report but I’m not sure – factors like CU cost and how long it takes to generate an Excel file using each approach would also need to be taken into account. If you just want to dump data to Excel, however, Dataflows Gen2 are probably a better option now.

Query Folding And Staging In Fabric Dataflows Gen2

A few years ago I wrote this post on the subject of staging in Fabric Dataflows Gen2. In it I explained what staging is, how you can enable it for a query inside a Dataflow, and discussed the pros and cons of using it. However one thing I never got round to doing until this week is looking at how you can tell if query folding is happening on staged data inside a Dataflow – which turns out to be harder to do than you might think.

Consider the following simple Dataflow consisting of two queries:

The first query, called StageData, reads data from a 6GB, 17 million row CSV file containing open data from the English Prescribing Data dataset. It returns two columns called PRACTICE_NAME and TOTAL_QUANTITY from that CSV file:

Staging is enabled on this query:

The second query, called GroupBy, takes the data returned by the StageData query and does a Group By operation to get the sum of the values in the TOTAL_QUANTITY column for each distinct value in PRACTICE_NAME:

The output of this query was loaded to a Fabric Warehouse:

The scenario is basically the same as the one from my previous post but with a much larger data volume, and the idea was to test again whether it was faster to stage the data and do the Group By on the staged data or to not stage the data and do the Group By while reading the data direct from the source.

It turned out that, once again, staging made performance worse (don’t worry, I have other tests that show it can help performance). But the point about staging is that by loading the data from a query into a hidden Fabric Lakehouse, managed by the Dataflow (which is what is meant by “staging”), any subsequent operations on this data are faster because query folding can take place against the SQL Endpoint of this hidden Lakehouse – and at the time of writing this post there’s no way of knowing from the Dataflows Editor whether query folding is taking place. Right-clicking on the step that does the Group By operation shows that the “View data source query” option is greyed out but this only tells you that you the Editor doesn’t know if folding is taking place:

In fact other things in the UI, such as the query plan and the query folding indicators, suggest incorrectly that folding is not taking place:

So I thought: if query folding is taking place then the Group By will result in a SQL query run against the hidden Lakehouse, so maybe I can see this SQL query somewhere? Unfortunately since the Lakehouse is hidden you can’t get to it through the Fabric web interface. But then I remembered that you can connect to a Fabric workspace using good old SQL Server Management Studio (instructions on how to can be found here). And when I connected using SSMS I could see two hidden objects created by by Dataflow called StagingLakehouseForDataflows and StagingWarehouseForDataflows:

I was then able to run a SQL query using the queryinsights.exec_requests_history DMV against StagingWarehouseForDataflows, filtered for the time range when my Dataflow refresh was taking place:

SELECT start_time,statement_type, command, total_elapsed_time_ms
FROM
queryinsights.exec_requests_history
WHERE
start_time>'insert DF refresh start time here'
AND end_time<'insert DF refresh end time here'
ORDER BY start_time desc

…and saw the following INSERT INTO statement that did the Group By operation I was expecting to see along with how long it took:

insert into
[StagingWarehouseForDataflows_20260223144925].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_2930fd3c_002D2a62_002D4518_002Dafbf_002D249e7af54403]
([Column1], [Column2])
select [_].[Column1] as [Column1],
convert(float, [_].[Total Quantity]) as [Column2]
from (
select [rows].[Column1] as [Column1],
sum([rows].[Column2]) as [Total Quantity]
from [StagingLakehouseForDataflows_20260223144911].[dbo].[1847c1263c7d4318a91dd6cd73ce48c6_179186be_002D367d_002D4924_002Da8ba_002Dd1f220415e3a]
as [rows]
group by [Column1] )
as [_]

So, a useful tip if you’re performance tuning a Dataflow even if it’s a bit of a pain to do. Hopefully in the future we’ll be able to see the SQL generated when query folding takes place against a staged table.

[Thanks to Miguel Escobar for his help with this]

When Can Partitioned Compute Help Improve Fabric Dataflow Performance?

Partitioned Compute is a new feature in Fabric Dataflows that allows you to run certain operations inside a Dataflow query in parallel and therefore improve performance. While UI support is limited at the moment it can be used in any Dataflow by adding a single line of fairly simple M code and checking a box in the Options dialog. But as with a lot of performance optimisation features (and this is particularly true of Dataflows) it can sometimes result in worse performance rather than better performance – you need to know how and when to use it. And so, in order to understand when this feature should and shouldn’t be used, I decided to do some tests and share the results here.

For my tests I created two queries within a single Dataflow Gen2 CICD. First, an M function called SlowFunction that takes a numeric value and returns that value with 1 added to it after a two second delay:

(input as number) as number =>
Function.InvokeAfter(()=>input+1, #duration(0,0,0,2))

Then the main query which returns a table of ten rows and calls the SlowFunction M function once per row:

let
Rows = List.Transform({1 .. 10}, each {_}),
MyTable = #table(type table [RowNumber = number], Rows),
#"Added custom" = Table.TransformColumnTypes(
Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])),
{{"FunctionOutput", Int64.Type}}
)
in
#"Added custom"

Here’s what the output of the query looks like:

Now, the first important question. How long does this Dataflow take to run? 10 rows calling a function that takes 2 seconds to run, so 10*2=20 seconds maybe? The answer is yes if you look at how long the preview takes to populate in the Dataflow Editor:

That’s just the preview though. If you’re refreshing a Dataflow there are other things that happen that affect how long that refresh takes, such as Staging and loading the data to a destination. There’s no way you can split the performance of your M code from these operations when looking at the duration of a Dataflow refresh in Recent Runs, which explains why some of the timings you will see later in this post seem strange. Don’t worry, though, it doesn’t stop you from seeing the important trends. I’m told that setting a CSV file in a Lakehouse as your data destination is the best way of minimising the impact of loading data on overall refresh durations but at the time of writing the CSV destination can’t be used with Partitioned Compute so all my tests used a Fabric Warehouse as a destination.

Here’s what Recent Runs showed when this Dataflow was refreshed:

The overall refresh time was 59 seconds; the query (called NonPartitioned here) that returns the table of ten rows and which was staged took 29 seconds.

Could this Dataflow benefit from Partitioned Compute? With Partitioned Compute enabled in the Options dialog, I added the necessary M code to the query:

let
Rows = List.Transform({1 .. 10}, each {_}),
MyTable = #table(type table [RowNumber = number], Rows),
#"Added custom" = Table.TransformColumnTypes(
Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])),
{{"FunctionOutput", Int64.Type}}
),
ReplacePartitionKey = Table.ReplacePartitionKey(#"Added custom", {"RowNumber"})
in
ReplacePartitionKey

…and then refreshed. Here’s what Recent Runs showed:

The overall refresh duration went up to 1 minute 25 seconds; the query that does all the work (called Partitioned in this case) took 40 seconds. Note the screenshot immediately above shows the engine used is “PartitionedCompute” and that there are now ten activities listed instead of one: my M code used the RowNumber column in the table as the partition key so the Dataflow attempted to run each row of the table as a separate operation in parallel. And as you can see, this made performance worse. This is because using Partitioned Compute introduces yet another overhead and that overhead is much greater than any benefit gained from parallelism in this case.

So I wondered: what if the delay in the query is increased from 2 second to 100 seconds then? Does this increase in the delay mean that parallelism results in faster overall performance?

Here’s what Recent Runs showed for a version of my Dataflow with a 100 second delay for each row and which didn’t use Partitioned Compute:

10 rows * 100 seconds = 1000 seconds = 16 minutes 40 seconds, so it’s not surprising that the overall duration of this version of the Dataflow was slow at 17 minutes 29 seconds.

Here’s what Recent Runs shows for the version of this Dataflow that did use Partitioned Compute:

The overall duration was 4 minutes 41 seconds and the main query took 3 minutes 14 seconds. The important takeaway is that this is a lot faster than the version that didn’t use Partitioned Compute, so clearly Partitioned Compute made a big difference to performance here. As you might expect, it looks like parallelising operations that only take a few seconds results in worse performance while parallelising operations that take longer, say a minute or more, is probably a good idea. As always, you’ll need to test to see what benefits you get for your Dataflows.

These results raise a lot of questions too. 100 seconds = 1 minute 40 seconds, which is a lot less than 3 minutes 14 seconds. Does this mean that not every row in the table was evaluated in parallel? Is partitioning on the RowNumber column counter-productive and would it be better to partition in some other way to try to reduce the amount of attempted parallelism? Is there something else that is limiting the amount of parallelism? While this version of the Dataflow always performs better than the non-partitioned version, performance did vary a lot between refreshes. While these tests show how useful Partitioned Compute can be for slow Dataflows, there’s a lot more research to do and a lot more blog posts to write.

A Closer Look At Preview-Only Steps In Fabric Dataflows

I have been spending a lot of time recently investigating the new performance-related features that have rolled out in Fabric Dataflows over the last few months, so expect a lot of blog posts on this subject in the near future. Probably my favourite of these features is Preview-Only steps: they make such a big difference to my quality of life as a Dataflows developer.

The basic idea (which you can read about in the very detailed docs here) is that you can add steps to a query inside a Dataflow that are only executed when you are editing the query and looking at data in the preview pane; when the Dataflow is refreshed these steps are ignored. This means you can do things like add filters, remove columns or summarise data while you’re editing the Dataflow in order to make the performance of the editor faster or debug data problems. It’s all very straightforward and works well.

The more I thought about this feature, though, the more I wondered about how it actually works and how it can be used in more complex queries that involve hand-written M code – so I decided to do a few tests. First of all, consider the following M query:

let
Step1 = 1,
Step2 = Step1 - 1
in
Step2

This query returns a numeric value: 0.

At this point it will return the same value in the editor when viewing the preview of the output and when the Dataflow refreshes. But if you right-click on Step2 and select “Enable only in previews” then the query still returns 0 in the preview but will return 1 when the Dataflow refreshes. You will also see this message displayed in the preview pane:

This data preview uses preview-only steps.
Results may differ when running the dataflow.

This makes sense because the query follows a linear pattern: the output references Step2 which in turn references Step1 so if you disable Step2 then the the output simply skips it and returns the value of Step1.

But what if your M code is more complex? For example consider this query:

let
x = 5,
y = 8,
xtimesy = x*y,
outputtable = #table(type table[xy=number],{{xtimesy}})
in
outputtable

Here’s what this returns in the editor: a table that contains the value 40.

What if you disable the step called “y”?

Here’s what I saw in my Warehouse when I loaded the output of the query to it:

I had no idea what the output would be before I saw it but, thinking about it, I assume what has happened is that since the step “y” has been disabled, “y” simply returns the value of the previous step in the query, “x”, and therefore the output becomes 5*5=25. This understanding of how preview-only steps work is backed-up by the fact that it is not possible to set step “x” to be preview-only. The option to do so is greyed out:

This all leads on to a pattern I used in a Dataflow this week and which I can see myself using a lot more in the future. While being able to disable steps when the Dataflow refreshes is very useful, sometimes what you need is to write some conditional logic in your M code that does one thing if you’re in the editor and another thing if the Dataflow is refreshing. Here’s a slightly modified version of the first query above which I have called IsRefresh in my Dataflow:

let
RefreshValue = 1,
PreviewValue = RefreshValue-1
in
PreviewValue

With the PreviewValue step set to be preview-only:

…then what we have here is a query that returns 1 when the Dataflow is refreshing and 0 when you’re viewing its output in the editor. Here’s an example of how it can be used in a query:

let
Source = #table(
type table [
NumericValue = number,
TextValue = text
],
{
{
IsRefresh,
if IsRefresh = 1 then
"This is a refresh"
else
"This is a preview"
}
}
)
in
Source

This query returns the following when you preview the output in the editor:

But when the Dataflow refreshes, the output in the destination is this:

Generating Sample Data With Copilot In Fabric Dataflows Gen2

As a Power Query fan I’m naturally interested in Copilot in Dataflows Gen2. But since Power Query is already very easy to use and since I’m pretty experienced at using it, can Copilot really do anything to help me be more productive? Well one thing I need to do a lot for blog posts, demos and testing is to generate tables of sample data. Typing data into the Enter Data source can be quite tedious and writing the M code to generate sample data isn’t straightforward, so I was wondering if Copilot in Dataflows Gen2 could do it for me. It turns out it can do so really well.

After a little bit of experimentation I’ve come up with an example prompt that can be easily modified to return a table of random values with columns of different data types and a given number of rows, and which works fairly reliably:

Return a table consisting of the following five columns: 
Country, containing one of 10 random country names; 
Product, containing one of 15 random fruit names; 
Person, containing one of 20 random first names that are common in the United Kingdom;
Date, containing random dates between January 1st 2019 and December 31st 2024;
Sales, containing random integers between 0 and 500. 
The table should have 1000 rows. Set an appropriate data type for each column.

Here’s an example of the M code generated:

let
    // Generate lists of random values
    Countries = {"United States", "Canada", "United Kingdom", "Germany", "France", "Italy", "Spain", "Australia", "India", "China"},
    Products = {"Apple", "Banana", "Cherry", "Date", "Elderberry", "Fig", "Grape", "Honeydew", "Kiwi", "Lemon", "Mango", "Nectarine", "Orange", "Papaya", "Quince"},
    People = {"James", "Mary", "John", "Patricia", "Robert", "Jennifer", "Michael", "Linda", "William", "Elizabeth", "David", "Barbara", "Richard", "Susan", "Joseph", "Jessica", "Thomas", "Sarah", "Charles", "Karen"},
    StartDate = #date(2019, 1, 1),
    EndDate = #date(2024, 12, 31),
    DaysBetween = Duration.Days(EndDate - StartDate) + 1,

    // Function to generate random dates
    RandomDate = () => Date.From(StartDate + #duration(Number.RoundDown(Number.RandomBetween(0, DaysBetween)), 0, 0, 0)),

    // Generate table records
    Records = List.Transform({1..1000}, each [
        Country = Countries{Number.Mod(_, List.Count(Countries))},
        Product = Products{Number.Mod(_, List.Count(Products))},
        Person = People{Number.Mod(_, List.Count(People))},
        Date = RandomDate(),
        Sales = Number.RoundDown(Number.RandomBetween(0, 500))
    ]),

    // Convert list of records to table and set data types
    Source = Table.FromRecords(Records),
    TypedTable = Table.TransformColumnTypes(Source, {
        {"Country", type text},
        {"Product", type text},
        {"Person", type text},
        {"Date", type date},
        {"Sales", Int32.Type}
    })
in
    TypedTable

And here’s an example of the output:

Definitely a time saver as far as I’m concerned. Is it totally reliable? No: it occasionally produces code that errors or which doesn’t contain genuinely random values, but it’s good enough and it’s faster to try the prompt once or twice than write the code myself. I know there are other, more sophisticated ways of generating sample data like this in Fabric, for example in Python, but as I said I’m a Power Query person.

And of course, for bonus points, we can now send the output of a Dataflow Gen2 to a CSV file in SharePoint which makes this even more useful: