Generating Sample Data In Fabric Dataflows With FabricAI.Prompt()

Back in December the FabricAI.Prompt() M function was released in Fabric Dataflows Gen2. Most of the people writing about it at that time, as in this great post by my colleague Sandeep Pawar, focused on calling this function for each row in a table – something that the UI in the editor makes easy. However the FabricAI.Prompt() function itself is a lot more flexible. You can use it to summarise whole tables of data as I showed here; you can also use it to generate sample data. This is similar to what I blogged about here where I got Copilot to generate M code that returned sample data but using FabricAI.Prompt() is maybe a bit simpler.

The trick is to get FabricAI.Prompt() to generate a table of data in CSV format. Here’s an example prompt:

			
Generate a table of sample sales data in CSV format. 
The table should have three columns called Country, Product and Sales. 
The Country column should contain the names of random European countries. 
The Product column should contain the names of random types of fruit. 
The Sales column should contain random numbers between 1 and 100. 
The table should contain 10 rows. 
The first row of text returned should contain the column names. 
Subsequent rows should contain the data.

		

…and here’s how this prompt can be used with Fabric.AIPrompt() and how the text that the function returns can be turned into a table using the CSV.Document function:

			
let
  Source = FabricAI.Prompt(
    "Generate a table of sample sales data in CSV format. The table should have three columns called Country, Product and Sales. The Country column should contain the names of random European countries. The Product column should contain the names of random types of fruit. The Sales column should contain random numbers between 1 and 100. The table should contain 10 rows. The first row of text returned should contain the column names. Subsequent rows should contain the data."
  ), 
  ToCSV = Csv.Document(Source), 
  #"Promoted headers" = Table.PromoteHeaders(
    ToCSV, 
    [PromoteAllScalars = true]
  ), 
  #"Changed column type"
    = Table.TransformColumnTypes(
    #"Promoted headers", 
    {
      {"Country", type text}, 
      {"Product", type text}, 
      {"Sales", Int64.Type}
    }
  )
in
  #"Changed column type"

		

And here’s an example of the result:

Generating Excel Reports With Fabric Dataflows Gen2

So many cool Fabric features get announced at Fabcon that it’s easy to miss some of them. The fact that you can now not only generate Excel files from Fabric Dataflows Gen2, but that you have so much control over the format that you can use this feature to build simple reports rather than plain old data dumps, is a great example: it was only mentioned halfway through this blog post on new stuff in Dataflows Gen2 Nonethless it was the Fabcon feature announcement that got me most excited. This is because it shows how Fabric Dataflows Gen2 have gone beyond being just a way to bring data into Fabric and are now a proper self-service ETL tool where you can extract data from a lot of different sources, transform it using Power Query, and load it to a variety of destinations both inside Fabric and outside it (such as CSV files, Snowflake and yes, Excel).

The documentation for the new Excel destination, which you can find here, is extremely detailed indeed so I thought it would be useful to show a simple example of how you can now use Dataflows Gen2 to build an Excel report. First of all I created a query using the Enter Data source that returned a table with some sales data in:

			
let
  Source = Table.FromRows(
    Json.Document(
      Binary.Decompress(
        Binary.FromText(
          "i45WciwoyEktVtJRMlWK1YlW8i9KzEsH8w0NwAIBqYlFYK4RmOtelFgAkbZUio0FAA==",
          BinaryEncoding.Base64
        ),
        Compression.Deflate
      )
    ),
    let
      _t = ((type nullable text) meta [Serialized.Text = true])
    in
      type table [Product = _t, Sales = _t]
  ),
  #"Changed column type" = Table.TransformColumnTypes(
    Source,
    {{"Product", type text}, {"Sales", Int64.Type}}
  )
in
  #"Changed column type"

		

I then created a query called ReportTitle that contained the text for my report’s title:

			
let
  Source = #table({"Title"},{{"My Sales Report"}})
in
  Source

…and a query called FruitSalesOverview that passes the data from the FruitSales query to the FabricAI.Prompt M function to generate a text summary of it:

			
let
  Source  = FabricAI.Prompt("Summarise the fruit sales data in 20 words or less", FruitSales),
  ToTable = #table({"Summary"}, {{Source}})
in
  ToTable

		

The last query I created, called Output, generated a navigation table in the format described in the docs to describe the Excel output: the report title in a range starting in cell B1, the report summary in a range starting in cell B3, the sales data in a table starting in cell B5 and a bar chart showing the sales data.

			
let
  excelDocument = #table(
    type table [
      Sheet      = nullable text,
      Name       = nullable text,
      PartType   = nullable text,
      Properties = nullable record,
      Data       = any
    ],
    {
      // Report title
      {"Sales", "Title", "Range", [StartCell = "B1", SkipHeader = true], ReportTitle},
      // Copilot-generated summary
      {"Sales", "Summary", "Range", [StartCell = "B3", SkipHeader = true], FruitSalesOverview},
      // Table containing sales data
      {
        "Sales",
        "SalesTable",
        "Table",
        [StartCell = "B5", TableStyle = "TableStyleMedium9"],
        FruitSales
      },
      //Column chart containing sales data
      {
        "Sales",
        "SalesChart",
        "Chart",
        [
          ChartType  = "Column",
          ChartTitle = "Fruit Sales",
          DataSeries = [AxisColumns = {"Product"}, ValueColumns = {"Sales"}]
        ],
        #table({}, {}) meta [Name = "SalesTable"]
      }
    }
  )
in
  excelDocument

		

I then set the Data Destination of the Output query to use the New File option to create an Excel file and save it to the Files section of a Fabric Lakehouse. The use of the Advanced format option meant that the navigation table returned by the Output query was used to determine the structure of the resulting Excel file.

After refreshing the Dataflow I downloaded the resulting Excel workbook from my Lakehouse. Here’s what it looked like:

Pretty fun. Does it give you full control over the format of the Excel file? No, not quite. Is it a somewhat code-heavy approach? Yes, but I suppose in the age of AI that doesn’t matter so much since you’re unlikely to write your own code (although, being old-school, I adapted the code above from the docs manually). Most importantly: is this a better way of dumping data to Excel and/or generating simple Excel reports in Fabric than paginated reports? Good question, especially since Power Query is now available in paginated reports. I suspect the answer is that although paginated reports are harder to build (though you can generate rdl with AI too, I’ve done it) and that it’s harder to control what a paginated report rendered as an Excel file looks like, paginated reports may still have the edge if you want an Excel report but I’m not sure – factors like CU cost and how long it takes to generate an Excel file using each approach would also need to be taken into account. If you just want to dump data to Excel, however, Dataflows Gen2 are probably a better option now.

When Can Partitioned Compute Help Improve Fabric Dataflow Performance?

Partitioned Compute is a new feature in Fabric Dataflows that allows you to run certain operations inside a Dataflow query in parallel and therefore improve performance. While UI support is limited at the moment it can be used in any Dataflow by adding a single line of fairly simple M code and checking a box in the Options dialog. But as with a lot of performance optimisation features (and this is particularly true of Dataflows) it can sometimes result in worse performance rather than better performance – you need to know how and when to use it. And so, in order to understand when this feature should and shouldn’t be used, I decided to do some tests and share the results here.

For my tests I created two queries within a single Dataflow Gen2 CICD. First, an M function called SlowFunction that takes a numeric value and returns that value with 1 added to it after a two second delay:

			
(input as number) as number => 
Function.InvokeAfter(()=>input+1, #duration(0,0,0,2))

Then the main query which returns a table of ten rows and calls the SlowFunction M function once per row:

			
let
  Rows = List.Transform({1 .. 10}, each {_}), 
  MyTable = #table(type table [RowNumber = number], Rows), 
  #"Added custom" = Table.TransformColumnTypes(
    Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])), 
    {{"FunctionOutput", Int64.Type}}
  )
in
  #"Added custom"

		

Here’s what the output of the query looks like:

Now, the first important question. How long does this Dataflow take to run? 10 rows calling a function that takes 2 seconds to run, so 10*2=20 seconds maybe? The answer is yes if you look at how long the preview takes to populate in the Dataflow Editor:

That’s just the preview though. If you’re refreshing a Dataflow there are other things that happen that affect how long that refresh takes, such as Staging and loading the data to a destination. There’s no way you can split the performance of your M code from these operations when looking at the duration of a Dataflow refresh in Recent Runs, which explains why some of the timings you will see later in this post seem strange. Don’t worry, though, it doesn’t stop you from seeing the important trends. I’m told that setting a CSV file in a Lakehouse as your data destination is the best way of minimising the impact of loading data on overall refresh durations but at the time of writing the CSV destination can’t be used with Partitioned Compute so all my tests used a Fabric Warehouse as a destination.

Here’s what Recent Runs showed when this Dataflow was refreshed:

The overall refresh time was 59 seconds; the query (called NonPartitioned here) that returns the table of ten rows and which was staged took 29 seconds.

Could this Dataflow benefit from Partitioned Compute? With Partitioned Compute enabled in the Options dialog, I added the necessary M code to the query:

			
let
  Rows = List.Transform({1 .. 10}, each {_}), 
  MyTable = #table(type table [RowNumber = number], Rows), 
  #"Added custom" = Table.TransformColumnTypes(
    Table.AddColumn(MyTable, "FunctionOutput", each SlowFunction([RowNumber])), 
    {{"FunctionOutput", Int64.Type}}
  ), 
  ReplacePartitionKey = Table.ReplacePartitionKey(#"Added custom", {"RowNumber"})
in
  ReplacePartitionKey

		

…and then refreshed. Here’s what Recent Runs showed:

The overall refresh duration went up to 1 minute 25 seconds; the query that does all the work (called Partitioned in this case) took 40 seconds. Note the screenshot immediately above shows the engine used is “PartitionedCompute” and that there are now ten activities listed instead of one: my M code used the RowNumber column in the table as the partition key so the Dataflow attempted to run each row of the table as a separate operation in parallel. And as you can see, this made performance worse. This is because using Partitioned Compute introduces yet another overhead and that overhead is much greater than any benefit gained from parallelism in this case.

So I wondered: what if the delay in the query is increased from 2 second to 100 seconds then? Does this increase in the delay mean that parallelism results in faster overall performance?

Here’s what Recent Runs showed for a version of my Dataflow with a 100 second delay for each row and which didn’t use Partitioned Compute:

10 rows * 100 seconds = 1000 seconds = 16 minutes 40 seconds, so it’s not surprising that the overall duration of this version of the Dataflow was slow at 17 minutes 29 seconds.

Here’s what Recent Runs shows for the version of this Dataflow that did use Partitioned Compute:

The overall duration was 4 minutes 41 seconds and the main query took 3 minutes 14 seconds. The important takeaway is that this is a lot faster than the version that didn’t use Partitioned Compute, so clearly Partitioned Compute made a big difference to performance here. As you might expect, it looks like parallelising operations that only take a few seconds results in worse performance while parallelising operations that take longer, say a minute or more, is probably a good idea. As always, you’ll need to test to see what benefits you get for your Dataflows.

These results raise a lot of questions too. 100 seconds = 1 minute 40 seconds, which is a lot less than 3 minutes 14 seconds. Does this mean that not every row in the table was evaluated in parallel? Is partitioning on the RowNumber column counter-productive and would it be better to partition in some other way to try to reduce the amount of attempted parallelism? Is there something else that is limiting the amount of parallelism? While this version of the Dataflow always performs better than the non-partitioned version, performance did vary a lot between refreshes. While these tests show how useful Partitioned Compute can be for slow Dataflows, there’s a lot more research to do and a lot more blog posts to write.

A Closer Look At Preview-Only Steps In Fabric Dataflows

I have been spending a lot of time recently investigating the new performance-related features that have rolled out in Fabric Dataflows over the last few months, so expect a lot of blog posts on this subject in the near future. Probably my favourite of these features is Preview-Only steps: they make such a big difference to my quality of life as a Dataflows developer.

The basic idea (which you can read about in the very detailed docs here) is that you can add steps to a query inside a Dataflow that are only executed when you are editing the query and looking at data in the preview pane; when the Dataflow is refreshed these steps are ignored. This means you can do things like add filters, remove columns or summarise data while you’re editing the Dataflow in order to make the performance of the editor faster or debug data problems. It’s all very straightforward and works well.

The more I thought about this feature, though, the more I wondered about how it actually works and how it can be used in more complex queries that involve hand-written M code – so I decided to do a few tests. First of all, consider the following M query:

			
let
  Step1 = 1,
  Step2 = Step1 - 1
in
  Step2

		

This query returns a numeric value: 0.

At this point it will return the same value in the editor when viewing the preview of the output and when the Dataflow refreshes. But if you right-click on Step2 and select “Enable only in previews” then the query still returns 0 in the preview but will return 1 when the Dataflow refreshes. You will also see this message displayed in the preview pane:

			
This data preview uses preview-only steps. 
Results may differ when running the dataflow.

This makes sense because the query follows a linear pattern: the output references Step2 which in turn references Step1 so if you disable Step2 then the the output simply skips it and returns the value of Step1.

But what if your M code is more complex? For example consider this query:

			
let
  x = 5,
  y = 8,
  xtimesy = x*y,
  outputtable = #table(type table[xy=number],{{xtimesy}})
in
  outputtable

		

Here’s what this returns in the editor: a table that contains the value 40.

What if you disable the step called “y”?

Here’s what I saw in my Warehouse when I loaded the output of the query to it:

I had no idea what the output would be before I saw it but, thinking about it, I assume what has happened is that since the step “y” has been disabled, “y” simply returns the value of the previous step in the query, “x”, and therefore the output becomes 5*5=25. This understanding of how preview-only steps work is backed-up by the fact that it is not possible to set step “x” to be preview-only. The option to do so is greyed out:

This all leads on to a pattern I used in a Dataflow this week and which I can see myself using a lot more in the future. While being able to disable steps when the Dataflow refreshes is very useful, sometimes what you need is to write some conditional logic in your M code that does one thing if you’re in the editor and another thing if the Dataflow is refreshing. Here’s a slightly modified version of the first query above which I have called IsRefresh in my Dataflow:

			
let
  RefreshValue = 1,
  PreviewValue = RefreshValue-1
in
  PreviewValue

		

With the PreviewValue step set to be preview-only:

…then what we have here is a query that returns 1 when the Dataflow is refreshing and 0 when you’re viewing its output in the editor. Here’s an example of how it can be used in a query:

			
let
  Source = #table(
    type table [
      NumericValue = number,
      TextValue    = text
    ],
    {
      {
        IsRefresh,
        if IsRefresh = 1 then
          "This is a refresh"
        else
          "This is a preview"
      }
    }
  )
in
  Source

		

This query returns the following when you preview the output in the editor:

But when the Dataflow refreshes, the output in the destination is this:

Generating Sample Data With Copilot In Fabric Dataflows Gen2

As a Power Query fan I’m naturally interested in Copilot in Dataflows Gen2. But since Power Query is already very easy to use and since I’m pretty experienced at using it, can Copilot really do anything to help me be more productive? Well one thing I need to do a lot for blog posts, demos and testing is to generate tables of sample data. Typing data into the Enter Data source can be quite tedious and writing the M code to generate sample data isn’t straightforward, so I was wondering if Copilot in Dataflows Gen2 could do it for me. It turns out it can do so really well.

After a little bit of experimentation I’ve come up with an example prompt that can be easily modified to return a table of random values with columns of different data types and a given number of rows, and which works fairly reliably:

Return a table consisting of the following five columns: 
Country, containing one of 10 random country names; 
Product, containing one of 15 random fruit names; 
Person, containing one of 20 random first names that are common in the United Kingdom;
Date, containing random dates between January 1st 2019 and December 31st 2024;
Sales, containing random integers between 0 and 500. 
The table should have 1000 rows. Set an appropriate data type for each column.

Here’s an example of the M code generated:

let
    // Generate lists of random values
    Countries = {"United States", "Canada", "United Kingdom", "Germany", "France", "Italy", "Spain", "Australia", "India", "China"},
    Products = {"Apple", "Banana", "Cherry", "Date", "Elderberry", "Fig", "Grape", "Honeydew", "Kiwi", "Lemon", "Mango", "Nectarine", "Orange", "Papaya", "Quince"},
    People = {"James", "Mary", "John", "Patricia", "Robert", "Jennifer", "Michael", "Linda", "William", "Elizabeth", "David", "Barbara", "Richard", "Susan", "Joseph", "Jessica", "Thomas", "Sarah", "Charles", "Karen"},
    StartDate = #date(2019, 1, 1),
    EndDate = #date(2024, 12, 31),
    DaysBetween = Duration.Days(EndDate - StartDate) + 1,

    // Function to generate random dates
    RandomDate = () => Date.From(StartDate + #duration(Number.RoundDown(Number.RandomBetween(0, DaysBetween)), 0, 0, 0)),

    // Generate table records
    Records = List.Transform({1..1000}, each [
        Country = Countries{Number.Mod(_, List.Count(Countries))},
        Product = Products{Number.Mod(_, List.Count(Products))},
        Person = People{Number.Mod(_, List.Count(People))},
        Date = RandomDate(),
        Sales = Number.RoundDown(Number.RandomBetween(0, 500))
    ]),

    // Convert list of records to table and set data types
    Source = Table.FromRecords(Records),
    TypedTable = Table.TransformColumnTypes(Source, {
        {"Country", type text},
        {"Product", type text},
        {"Person", type text},
        {"Date", type date},
        {"Sales", Int32.Type}
    })
in
    TypedTable

And here’s an example of the output:

Definitely a time saver as far as I’m concerned. Is it totally reliable? No: it occasionally produces code that errors or which doesn’t contain genuinely random values, but it’s good enough and it’s faster to try the prompt once or twice than write the code myself. I know there are other, more sophisticated ways of generating sample data like this in Fabric, for example in Python, but as I said I’m a Power Query person.

And of course, for bonus points, we can now send the output of a Dataflow Gen2 to a CSV file in SharePoint which makes this even more useful:

Two New Power BI/Power Query Books

It’s time for another one of my occasional posts about free books that I’ve been given that you might be interested to check out. The usual disclaimers apply: these aren’t impartial, detailed reviews and the links contain an Amazon UK affiliate code so I get a kickback if you buy a copy.

Power Query Beyond The User Interface, by Chandeep Chhabra

Chandeep is of course famous on YouTube from the Goodly Power BI channel; I’ve met him at conferences and he’s just as friendly and passionate in real life. That passion shows through in this book. It’s aimed at people who already know Power Query and want to learn M and while it does a great job of that, I think anyone who writes M on a regular basis would also benefit from reading it. It’s packed with practical examples, well-written, everything is clearly explained and it covers more recent additions to the language that older books might not talk about. It’s focused on the M language and doesn’t cover topics like performance tuning but I think that focus is a good thing. Highly recommended for anyone serious about Power Query.

Architecting Power BI Solutions In Microsoft Fabric, by Nagaraj Venkatesan

It looks like the Packt marketing machine has gone into overdrive for this title because I see both Greg Lowe and Sandeep Pawar have already published their reviews, and I agree with their sentiments. Power BI (and even more so Fabric) is complicated and so there’s a huge demand for guidance around what all of the components do and how to put them together to create a solution. The team I work on at Microsoft, the Fabric CAT team, has a published guidance documentation here and other people have written books, blog posts and white papers addressing the same problem. This book is certainly a very useful addition to the existing literature. It covers newer topics like Power BI Copilot and some topics that are rarely if ever mentioned elsewhere, such as Power BI’s integration with Purview. As the other reviewers have mentioned, books like this always suffer from changes to the product making them out of date very quickly but that’s unavoidable. Also, being written by a Microsoft employee (and this is something I can relate to), it’s not very opinionated and doesn’t tell you which features of the product are good and which ones should be avoided. All in all, pretty good though.

Reading Delta Metadata In Power Query

There’s a new M function rolling out now that allows you to read metadata from Delta tables (at the time of writing it’s available in Dataflows Gen2 and will be available soon in Desktop). It builds on the DeltaLake.Table M function that allows you to read data from Delta tables and is similar to the Parquet.Metadata function that was released last year. Here’s an example of how to use it to get metadata from a Delta table in OneLake:

let
  Source = AzureStorage.DataLake(
    "https://onelake.dfs.fabric.microsoft.com/insertworkspaceidhere/insertlakehouseidhere/Tables/inserttablenamehere", 
    [HierarchicalNavigation = true]
  ), 
  ToDelta = DeltaLake.Metadata(
    DeltaLake.Table(Source)
  )
in
  ToDelta

The function returns a table of records containing the metadata from the Delta table such as the schema, how the table is partitioned, and whether the table is V-Ordered or not:

Improve Power Query Performance On CSV Files Containing Date Columns

A few weeks ago I replied to a question on reddit where someone was experiencing extremely slow performance when importing data from a CSV file using Power Query. The original poster worked out the cause of the problem and the solution themselves: they saw that removing all date columns from their query made their Power Query query much faster and that using the Date.FromText function and specifying the date format solved the problem. While I couldn’t reproduce the extreme slowness that was reported I was able to reproduce a performance difference between the two approaches and Curt Hagenlocher of the Power Query team confirmed that this was expected behaviour.

Let’s see an example. I created a CSV file with five date columns and one million rows, then created a Power Query query to import this data into Power BI Desktop using the default M code generated by the Power Query Editor:

let
  Source = Csv.Document(
    File.Contents("C:\GenerateDates.csv"),
    [
      Delimiter  = ",",
      Columns    = 5,
      Encoding   = 65001,
      QuoteStyle = QuoteStyle.None
    ]
  ),
  #"Promoted Headers" = Table.PromoteHeaders(
    Source,
    [PromoteAllScalars = true]
  ),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Promoted Headers",
    {
      {"Extra Spaces", type date},
      {"Extra Spaces - 2", type date},
      {"Extra Spaces - 3", type date},
      {"Extra Spaces - 4", type date},
      {"Extra Spaces - 5", type date}
    }
  )
in
  #"Changed Type"

The dates in the CSV file were in the following format:

02  Jan   1901

…and this is important: there are two spaces between the day and the month name and three spaces between the month name and the year.

Using SQL Server Profiler I found that this query took around 14 seconds to run.

I then created a second query that, instead of using Table.TransformColumnTypes to set the data type on the columns, used Date.FromText and the Format option:

let
  Source = Csv.Document(
    File.Contents("C:\GenerateDates.csv"),
    [
      Delimiter  = ",",
      Columns    = 5,
      Encoding   = 65001,
      QuoteStyle = QuoteStyle.None
    ]
  ),
  #"Promoted Headers" = Table.PromoteHeaders(
    Source,
    [PromoteAllScalars = true]
  ),
  DateConversionFunction = (inputDate) as date =>
    Date.FromText(
      inputDate,
      [Format = "dd  MMM   yyyy"]
    ),
  ChangeDate = Table.TransformColumns(
    #"Promoted Headers",
    {
      {
        "Extra Spaces",
        DateConversionFunction,
        Date.Type
      },
      {
        "Extra Spaces - 2",
        DateConversionFunction,
        Date.Type
      },
      {
        "Extra Spaces - 3",
        DateConversionFunction,
        Date.Type
      },
      {
        "Extra Spaces - 4",
        DateConversionFunction,
        Date.Type
      },
      {
        "Extra Spaces - 5",
        DateConversionFunction,
        Date.Type
      }
    }
  )
in
  ChangeDate

This version of the query took around 10.5 seconds to run, so not a huge improvement but a noticeable one. It’s certainly not the 6/7x performance improvement seen on the reddit post but I’m sure different data, different date formats and different hardware might result in bigger differences.

I was told by Curt that when Power Query uses Table.TransformColumnTypes to parse date data from CSV files it tries a series of different date formats in order: first it tries ISO-8601 (for example 9th February 2025 would be “2025-02-09”), then a long date format, then a short date format, and finally it uses a generic .NET date parsing function which is slower than the others. It does this to make sure date parsing “just works” as often as possible. The dates in the example above, with the extra spaces, were deliberately designed to be slow for Table.TransformColumnTypes. When I tested on CSV files that contained dates in IS-8601 format I found that Table.TransformColumnTypes performed the same as Date.FromText.

So, to sum up, if you’re using CSV files containing date columns as a source for Power Query and you’re experiencing performance problems, try changing your M code to use Date.FromText instead of Table.TransformColumnTypes to set the data types on the date columns.

Module.Versions Function In Power Query

The ever-vigilant folks on the internet have spotted that there’s a new M function in the latest versions of Power BI and Excel: Module.Versions. This function, at the time of writing, returns a record with a single field in that contains the version number of the Power Query engine currently in use. So for example if I have a Power Query query in Power BI Desktop that consists of the following code:

Module.Versions()

It returns the following:

…where 2.129.181.0 is the version of the Power Query engine in my build of Power BI Desktop.

This function was introduced for people developing Power Query custom connectors who only want to enable certain functionality if the user is running a given version of the Power Query engine or above. I guess if you’re sharing your own M custom functions on the internet then you might want to do the same thing.

[Thanks to Curt Hagenlocher for giving me the inside information here]

New Semi Join, Anti Join And Query Folding Functionality In Power Query

There are a couple of nice new features to do with table joins (or merges as they are known in M) and query folding in Power Query in the April release of Power BI Desktop that I want to highlight.

Anti Joins now fold

First of all, a few months ago I wrote a post about how the built-in anti join functionality didn’t fold in Power Query. The good news is that it now does on SQL Server-related sources, so no more workarounds are needed. For example, if you have two tables in a SQL Server database called Fruit1 and Fruit2 and two Power Query queries that get data from those tables:

…then the following M code:

let
  Source = Table.Join(
    Fruit1,
    {"Fruit1"},
    Fruit2,
    {"Fruit2"},
    JoinKind.LeftAnti
  )
in
  Source

…returns the following table of fruits that are in the Fruit1 table and not in the Fruit2 table:

Of course that’s what the code above returned in previous versions of Power Query too. The difference now is that query folding occurs and the following SQL code is generated:

select [$Outer].[Fruit1],
    cast(null as nvarchar(50)) as [Fruit2]
from 
(
    select [_].[Fruit] as [Fruit1]
    from [dbo].[Fruit1] as [_]
) as [$Outer]
where not exists 
(
    select 1
    from 
    (
        select [_].[Fruit] as [Fruit2]
        from [dbo].[Fruit2] as [_]
    ) as [$Inner]
    where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null

New join kind: semi joins

There are also two brand new join kind you can use in the Table.Join and Table.NestedJoin functions: JoinKind.LeftSemi and JoinKind.RightSemi. Semi joins allow you to select the rows in one table that have matching values in another table. Using the Fruit1 and Fruit2 tables above, the following M code:

let
  Source = Table.Join(
    Fruit1, 
    {"Fruit1"}, 
    Fruit2, 
    {"Fruit2"}, 
    JoinKind.LeftSemi
  )
in
  Source

Returns all the rows in Fruit1 where there is a matching value in Fruit2:

Here’s the SQL that is generated:

select [$Outer].[Fruit1],
    cast(null as nvarchar(50)) as [Fruit2]
from 
(
    select [_].[Fruit] as [Fruit1]
    from [dbo].[Fruit1] as [_]
) as [$Outer]
where exists 
(
    select 1
    from 
    (
        select [_].[Fruit] as [Fruit2]
        from [dbo].[Fruit2] as [_]
    ) as [$Inner]
    where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null

The ?? operator now folds

The M language’s ?? coalesce operator is used for replacing null values and this now folds on SQL Server-related sources too now. For example, the M query in the previous section that did a semi join on Fruit1 and Fruit2 returns a table where all the rows in the Fruit2 colum contain null values. The following M query adds a new custom column that returns the text value “Nothing” when the Fruit2 column contains a null:

let
  Source = Table.Join(
    Fruit1, 
    {"Fruit1"}, 
    Fruit2, 
    {"Fruit2"}, 
    JoinKind.LeftSemi
  ), 
  ReplaceNulls = Table.AddColumn(
    Source, 
    "NullReplacement", 
    each [Fruit2] ?? "Nothing"
  )
in
  ReplaceNulls

Here’s the SQL generated for this, where the ?? operator is folded to a CASE statement:

select [_].[Fruit1] as [Fruit1],
    [_].[Fruit2] as [Fruit2],
    case
        when [_].[Fruit2] is null
        then 'Nothing'
        else [_].[Fruit2]
    end as [NullReplacement]
from 
(
    select [$Outer].[Fruit1],
        cast(null as nvarchar(50)) as [Fruit2]
    from 
    (
        select [_].[Fruit] as [Fruit1]
        from [dbo].[Fruit1] as [_]
    ) as [$Outer]
    where exists 
    (
        select 1
        from 
        (
            select [_].[Fruit] as [Fruit2]
            from [dbo].[Fruit2] as [_]
        ) as [$Inner]
        where [$Outer].[Fruit1] = [$Inner].[Fruit2] or [$Outer].[Fruit1] is null and [$Inner].[Fruit2] is null
    )
) as [_]

[Thanks to Curt Hagenlocher for the information in this post]

Category: M

Generating Sample Data In Fabric Dataflows With FabricAI.Prompt()

Like this:

Generating Excel Reports With Fabric Dataflows Gen2

Like this:

When Can Partitioned Compute Help Improve Fabric Dataflow Performance?

Like this:

A Closer Look At Preview-Only Steps In Fabric Dataflows

Like this:

Generating Sample Data With Copilot In Fabric Dataflows Gen2

Like this:

Two New Power BI/Power Query Books

Like this:

Reading Delta Metadata In Power Query

Like this:

Improve Power Query Performance On CSV Files Containing Date Columns

Like this:

Module.Versions Function In Power Query

Like this:

New Semi Join, Anti Join And Query Folding Functionality In Power Query

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: