Generating Sample Data With Copilot In Fabric Dataflows Gen2

As a Power Query fan I’m naturally interested in Copilot in Dataflows Gen2. But since Power Query is already very easy to use and since I’m pretty experienced at using it, can Copilot really do anything to help me be more productive? Well one thing I need to do a lot for blog posts, demos and testing is to generate tables of sample data. Typing data into the Enter Data source can be quite tedious and writing the M code to generate sample data isn’t straightforward, so I was wondering if Copilot in Dataflows Gen2 could do it for me. It turns out it can do so really well.

After a little bit of experimentation I’ve come up with an example prompt that can be easily modified to return a table of random values with columns of different data types and a given number of rows, and which works fairly reliably:

Return a table consisting of the following five columns: 
Country, containing one of 10 random country names; 
Product, containing one of 15 random fruit names; 
Person, containing one of 20 random first names that are common in the United Kingdom;
Date, containing random dates between January 1st 2019 and December 31st 2024;
Sales, containing random integers between 0 and 500. 
The table should have 1000 rows. Set an appropriate data type for each column.

Here’s an example of the M code generated:

let
    // Generate lists of random values
    Countries = {"United States", "Canada", "United Kingdom", "Germany", "France", "Italy", "Spain", "Australia", "India", "China"},
    Products = {"Apple", "Banana", "Cherry", "Date", "Elderberry", "Fig", "Grape", "Honeydew", "Kiwi", "Lemon", "Mango", "Nectarine", "Orange", "Papaya", "Quince"},
    People = {"James", "Mary", "John", "Patricia", "Robert", "Jennifer", "Michael", "Linda", "William", "Elizabeth", "David", "Barbara", "Richard", "Susan", "Joseph", "Jessica", "Thomas", "Sarah", "Charles", "Karen"},
    StartDate = #date(2019, 1, 1),
    EndDate = #date(2024, 12, 31),
    DaysBetween = Duration.Days(EndDate - StartDate) + 1,

    // Function to generate random dates
    RandomDate = () => Date.From(StartDate + #duration(Number.RoundDown(Number.RandomBetween(0, DaysBetween)), 0, 0, 0)),

    // Generate table records
    Records = List.Transform({1..1000}, each [
        Country = Countries{Number.Mod(_, List.Count(Countries))},
        Product = Products{Number.Mod(_, List.Count(Products))},
        Person = People{Number.Mod(_, List.Count(People))},
        Date = RandomDate(),
        Sales = Number.RoundDown(Number.RandomBetween(0, 500))
    ]),

    // Convert list of records to table and set data types
    Source = Table.FromRecords(Records),
    TypedTable = Table.TransformColumnTypes(Source, {
        {"Country", type text},
        {"Product", type text},
        {"Person", type text},
        {"Date", type date},
        {"Sales", Int32.Type}
    })
in
    TypedTable

And here’s an example of the output:

Definitely a time saver as far as I’m concerned. Is it totally reliable? No: it occasionally produces code that errors or which doesn’t contain genuinely random values, but it’s good enough and it’s faster to try the prompt once or twice than write the code myself. I know there are other, more sophisticated ways of generating sample data like this in Fabric, for example in Python, but as I said I’m a Power Query person.

And of course, for bonus points, we can now send the output of a Dataflow Gen2 to a CSV file in SharePoint which makes this even more useful:

Performance Implications Of Different Ways Of Fully Expanding A Power BI Matrix Visual

If you have a Power BI report with a matrix visual on it it’s quite likely that you’ll want all the levels in the matrix to be fully expanded by default. But did you know that the way you expand all the levels could have performance implications, especially if you’re using DirectQuery mode? Here’s an example.

I have a DirectQuery semantic model built on top of some of the tables from the SQL Server AdventureWorksDW sample database (apologies for the poor naming):

There are four DAX measures defined on it:

Sales Amount = SUM(FactInternetSales[SalesAmount])

Monday Sales = CALCULATE([Sales Amount], 'DimDate'[EnglishDayNameOfWeek]="Monday")

January Sales = CALCULATE([Sales Amount], 'DimDate'[EnglishMonthName]="January")

Class H Sales = CALCULATE([Sales Amount], 'DimProduct'[Class]="H")

I wrote these measures specifically to exacerbate the problem I’m going to show (by reducing the amount of fusion that is possible) but they are pretty normal, reasonable measures that you might find in any semantic model.

Now let’s say you add a matrix visual to a report page, put these four measures onto the columns axis of the matrix, and drop the CalendarYear column (from the DimDate table), the Color column and the Style column (both from the DimProduct table) onto the rows axis of the matrix. At this point it looks like this:

…but what you want to do now is show all the styles and colours too.

One way to do it – not the most efficient way, but some people like me just love to click – is to expand every year and style individually:

It doesn’t take too long to expand everything and after all you only need to do it once, right? But let’s take the DAX query generated for this visual and paste it into DAX Studio with Server Timings turned on and see what we can see:

There are 14 separate Storage Engine queries – which result in 14 separate SQL queries being sent to SQL Server. The first two Storage Engine/SQL queries get a list of which years and styles have been drilled down on and then there are (4 measures) * (3 levels of granularity) = 12 other Storage Engine queries to get the data shown in the visual. The overall duration of 230ms here is very low but in the real world the SQL queries could be a lot slower, making the DAX query very slow.

The default limits on the number of SQL queries that a DAX query can run in parallel have a big impact on overall performance here as you can see; even though you can increase those limits you may then hit the maximum number of connections that can be opened up to a DirectQuery source, and even though you can increase that limit too if you’re running on a capacity, there are hard limits here. If Power BI needs to open new connections to the data source in order to run these SQL queries, that can also slow things down too because there can sometimes be a noticeable wait when connections are opened. Reducing the number of Storage Engine queries generated by a DAX query is very important when tuning DirectQuery models; the effect is going to be a lot less noticeable on an Import or Direct Lake semantic model but it could still cause problems.

There’s good news though. If you expand the levels in your matrix in a different (and to be honest, much more convenient) way using the “Expand all down one level in the hierarchy” button on the visual header or the “Expand to next level” option on the right-click menu for the rows like so:

…then you get the same result but with a much more efficient DAX query. Here’s what Server Timings shows for the DAX query generated for the fully expanded matrix now:

This time there are only four Storage Engine/SQL queries, one for each measure, and the overall duration is just 50ms. Even though, as you can see from the screenshot, only three Storage Engine/SQL queries can run in parallel and the fourth has to wait for the first query to finish so it can run, that’s less of an issue given the smaller number of queries. I won’t bother showing the DAX for the two versions of the matrix but it’s clear when you look at them the second one is more efficient because it knows it can expand everything on rows rather than just what has been clicked. Of course this type of optimisation is only possible if you are fully expanding your matrix though.

Finding Events Linked To A Specific Power BI Visual In Fabric Workspace Monitoring

Over the last few years one topic I have blogged about several times is how to link the detailed data about DAX query execution that can be found in Log Analytics – and now Workspace Monitoring – to the visual in a Power BI report that generated those DAX queries, something that is extremely useful when you’re performance tuning Power BI reports. My first post here from 2021 showed how write the KQL for Log Analytics but the problem has always been how to get the IDs of the visuals in a Power BI report. You can get the IDs from the definition of the Power BI report, as shown here, and Sandeep Pawar has a great post on some other methods here, but all these methods were superseded in the March release of Power BI with the ability to copy the IDs by right clicking on the visual in a Power BI report when editing it (thank you Rui Romano!).

This made me realise that it’s time to revisit my first post on how to get the query details in KQL since the column names in Workspace Monitoring are slightly different from Log Analytics and, indeed, the KQL needed can be simplified from my original version. Here’s a KQL query that you can run in a KQL Queryset connected to your Monitoring Eventhouse:

let VisualId = "InsertVisualIdHere";
SemanticModelLogs
| search VisualId
| project Timestamp, OperationName, OperationDetailName, DurationMs, EventText, OperationId 
| order by Timestamp asc

Once you’ve copied the visual’s ID from the report (remember you need to specifically enable this feature and that you need to be in Edit mode) by right-clicking on it and selecting “Copy object name”:

…then you just need to paste the ID into the let statement in the first line of the KQL query and run it:

You can find the documentation for the columns in the SemanticModelLogs table here and the documentation for the events here. The events for a specific query all have the same value in the OperationId column.

Bonus fact: you can now run queries against Workspace Monitoring using Semantic Link Labs, as documented here, which makes it much easier to do other fun stuff with this data. For example, I can imagine there are ways to visualise DAX query and semantic model refresh events in Python that would make them much easier to analyse, but that’s something for a future blog post.

Documenting Power BI Semantic Models With Fabric Data Agents

AI is meant to help us automate boring tasks, and what could be more boring than creating documentation for your Power BI semantic models? It’s such a tedious task that most people don’t bother; there’s also an ecosystem of third party tools that do this job for you, and you can also build your own solution for this using DAX DMVs or the new-ish INFO functions (see here for a good example). That got me wondering: can you use Fabric Data Agents to generate documentation for you? And what’s more, why even generate documentation when you can just ask a Data Agent the questions that you’d need to generate documentation to answer?

To test this idea out, I dug up a semantic model I’ve used for blog posts in the past with a simple star schema and a few measures:

Here are the definitions of the measures:

Units = SUM(Sales[SalesUnits])
Value = SUM(Sales[SalesValue])
Apples Value = 
CALCULATE([Value], KEEPFILTERS('Product'[ProductName]="Apples"), USERELATIONSHIP('Product'[ProductID], 'Sales'[ProductID]))
Apples Value Forecast = [Apples Value] * 1.1 

After publishing this model I created a Data Agent directly connected to it and found I couldn’t access the metadata I needed from it (I didn’t try too hard, but I don’t think it was possible) so I then created a second Import-mode semantic model, used the first as a data source and created a single table bound to the following DAX query using the INFO.CALCDEPENDENCY function:

EVALUATE INFO.CALCDEPENDENCY()

This second semantic model therefore contained a list of all the DAX calculations, their definitions and the dependencies between them, from the first semantic model. Here’s that table:

I then published this second semantic model, built a Data Agent on it, and used it for all my tests. I also added some instructions to the Data Agent – I freely admit I have no idea whatsoever about prompt engineering but after a bit of trial and error the following instructions did the trick:

You are a Power BI semantic model developer.
The CalcDependencies table contains the output of the DAX INFO.CALCDEPENDENCY() function for another Power BI semantic model. It contains information about the dependencies between the columns, tables, measures, calculated columns, calculated tables and relationships in that model. 
Each row in the CalcDependencies table represents a dependency between two objects. 
The combination of values in the OBJECT_TYPE, TABLE and OBJECT column uniquely identifies the object which is the source of the dependency.
The combination of values in the REFERENCED_OBJECT_TYPE, REFERENCED_TABLE and _REFERENCED_OBJECT columns uniquely identifies the object which is the target of the identity.
You can join the table to itself multiple times to find chains of dependencies between objects.
When I ask about dependencies, please include direct dependencies and also indirect dependencies, for example where one object has a dependency on another object which in turn has a dependency on the object I am asking about.
The EXPRESSION column contains the DAX definition of an object.
A value of CALC_COLUMN in the OBJECT_TYPE table indicates that the object is a calculated column.

First test: create some documentation. The following prompt:

Generate some documentation for the measures listed in the CalcDependencies table. Describe what each calculation does and how it works.

…generated the following output:

Here is the documentation for the measures listed in the CalcDependencies table, describing what each calculation does and how it works:

Measures

1. Units

  • Definition: SUM(Sales[SalesUnits])
  • Description: The Units measure calculates the total number of sales units. It does this by summing the values in the SalesUnits column in the Sales table. This measure is useful for tracking the total quantity of items sold.

2. Value

  • Definition: SUM(Sales[SalesValue])
  • Description: The Value measure calculates the total sales value. It achieves this by summing up the values in the SalesValue column in the Sales table. This measure is essential for understanding the total revenue generated from sales.

3. Apples Value

  • Definition: CALCULATE([Value], KEEPFILTERS('Product'[ProductName]="Apples"), USERELATIONSHIP('Product'[ProductID], 'Sales'[ProductID]))
  • Description: The Apples Value measure calculates the total sales value specifically for the product “Apples.” It works by applying a filter to include only rows where the ProductName in the Product table is “Apples.” Additionally, it uses a specific relationship between the Product and Sales tables defined by the ProductID columns. This measure helps in analyzing the revenue generated from the sale of apples.

4. Apples Value Forecast

  • Definition: [Apples Value] * 1.1
  • Description: The Apples Value Forecast measure estimates the future sales value for the product “Apples” based on a forecasted increase. It multiplies the current Apples Value by 1.1 (indicating a 10% increase). This measure is useful for predicting future revenue from apple sales.

These descriptions provide a clear understanding of what each measure does and how it works, helping anyone who looks at the documentation to comprehend the purpose and functionality of each calculation.

The verdict? More than good enough. To be honest, I was expecting the Data Agent to do a good job here though.

Second test: understand the dependencies between calculations. This is an interesting challenge because you can have chained relationships (ie measures that depend on other measures that depend on other measures etc) which makes the DAX harder to write. Also, you can’t easily understand dependencies by just looking at the output of INFO.CALCDEPENDENCIES – most of the third party tools mentioned above solve this by visualising the data with some kind of network diagram.

The following prompt:

Which measures would be affected if I deleted the SalesValue column on the Sales table?

…generated the following output:

If you delete the SalesValue column from the Sales table, the following measures will be affected:

  1. Measure Name: Value
    • Definition: SUM(Sales[SalesValue])

Since the Value measure directly depends on the SalesValue column, deleting the SalesValue column will cause the Value measure to break. Additionally, any measures that depend on the Value measure, such as Apples Value and Apples Value Forecast, will also be affected indirectly:

  1. Measure Name: Apples Value
    • Definition: CALCULATE([Value], KEEPFILTERS('Product'[ProductName]="Apples"), USERELATIONSHIP('Product'[ProductID], 'Sales'[ProductID]))
  2. Measure Name: Apples Value Forecast
    • Definition: [Apples Value] * 1.1

Therefore, deleting the SalesValue column will break the ValueApples Value, and Apples Value Forecast measures.

The verdict? I’m impressed, this is useful.

Final test: can we generate a visualisation of the dependencies? A few months ago one of my colleagues showed me the Mermaid language for defining diagrams like this and the rather cool Markdown & Mermaid custom visual that allows you to display Mermaid diagrams in Power BI. The following prompt:

Generate an expression in the Mermaid diagramming language that shows all the measures in the model and the dependencies between them

…generated the following response:

Here is an expression in the Mermaid diagramming language that shows all the measures in the model and the dependencies between them:

graph TD
  Units --> Units
  Value --> Value
  Apples_Value --> Value
  Apples_Value_Forecast --> Apples_Value

  Units["Units"]
  Value["Value"]
  Apples_Value["Apples Value"]
  Apples_Value_Forecast["Apples Value Forecast"]

Explanation:

  • Units is a measure with no dependencies on other measures.
  • Value is a base measure.
  • Apples Value depends on the Value measure.
  • Apples Value Forecast depends on the Apples Value measure, which in turn depends on the Value measure.

This Mermaid diagram visually represents all the measures in the model and the dependencies between them using arrows.

Pasting the expression in the response into the Markdown & Mermaid custom visual gave me the following:

I wasn’t particularly happy with those messy lines linking the Value and Units measures to themselves, but overall it does the job pretty well.

This blog post is already long enough and while these are simple tests, I think I’ve proved my point that Fabric Data Agents can be very useful when it comes to understanding the structure of a semantic model and the calculations in it. Bringing in data from other DAX INFO functions and tuning the AI Instructions would allow you to answer even more complex questions. I can also imagine a solution using Semantic Link Labs that not only dumps the data from DAX INFO functions for all your semantic models into a table in OneLake for consumption by a Data Agent, but also then uses the Data Agent SDK to automatically generate markdown documentation for those models, complete with diagrams. The possibilities are endless!