Why You Should Defragment Your Fact Tables If You’re Using Incremental Refresh In Power BI Premium

A lot of people are building large datasets in Power BI Premium nowadays, and if you’re building a large dataset you’re very likely to be using Power BI’s incremental refresh functionality with your fact tables or manually creating and refreshing partitions in them. If so, you should occasionally do a special kind of refresh to shrink the dictionaries on your fact tables.

To illustrate why you need to do this I created a Power BI dataset with one table containing one column, where each value was a GUID (ie a unique text value). Here’s the M I used to generate it:

let
    Source = {1..10000000},
    Output = #table(type table [MyGUID=text], 
        List.Transform(
            Source, 
            each {Text.NewGuid()})
    )
in
    Output

After publishing the dataset I copied the original partition so there were two partitions in the table, each of which contained ten million rows. When you use incremental refresh in a table Power BI automatically generates and manages partitions for you in the background – and there isn’t much difference between doing this and creating partitions manually like I’m doing here.

Here’s what DAX Studio’s Model Metrics functionality showed when I first refreshed the dataset:

The Dictionary column is the column to focus on in this table. Dictionaries are the data structures created by Power BI’s storage engine, the Vertipaq engine, to hold all the distinct values in a column – the more distinct values there are, the larger the dictionary. As Marco points out here the DMVs that DAX Studio uses to measure dictionary size don’t always return consistent or accurate results but the values it returns are good enough for the purposes of this test.

I then did several Full refreshes on just one of the two partitions. Here’s what DAX Studio showed at this point:

As you can see, the size of the dictionary for the MyGUID column (and therefore the overall size of the dataset) has grown a lot – about 1GB – even though the cardinality of the column has stayed the same. This is because dictionaries are held at the table level, not at the partition level, so by refreshing just one partition I was adding 10 million new text values to the dictionary without deleting any of the old values that has been added to the dictionary by previous refreshes and which were no longer there in the table.

Finally, I ran the following Refresh command against the workspace’s XMLA Endpoint from SQL Server Management Studio:

{
  "refresh": {
    "type": "defragment",
    "objects": [
      {
        "database": "TableWithManyDistinctValues",
        "table": "MyTable"
      }
    ]
  }
}

[Note that XMLA Endpoints are only available if you’re using Premium or PPU]

This refresh is of type defragment and it is documented here. This is the explanation of what it does from the docs:

Defragment the data in the specified table. As data is added to or removed from a table, the dictionaries of each column can become polluted with values that no longer exist in the actual column values. The defragment option will clean up the values in the dictionaries that are no longer used.

After that you can see the dictionary for the table, and therefore the size of the dataset, was back to around its original size:

This is an extreme example of course, but I can imagine something nearly this bad happening in the real world if you use incremental refresh on a table with a text column containing addresses or user comments and if you don’t keep much history. This isn’t new behaviour or functionality: this happens in Analysis Services Tabular models too and is discussed in books like Marco and Alberto’s “Tabular Modelling in Microsoft SQL Server Analysis Services” but not many people are aware it still applies to Power BI datasets. This type of refresh can be quite slow but it is likely to be faster than the only other option (which is the only option if you’re not using Premium) which is to run a full refresh on the dataset.

So why should you run a refresh of type defragment on your dataset? There are two reasons:

If you’re getting close to the size limits imposed by the Premium capacity SKU you’re using, and you want to make sure that you don’t hit them.
Large datasets generally result in slower query performance, so reducing the size of your dataset may result in faster queries.

How often should you do this? In most cases not very often; it depends on how much your dataset shrinks when you do defragment. Once a year may be sufficient and I doubt anyone would need to do it more than once a month. One way to automate this would be using Power Automate to call the Enhanced Refresh API, as I showed in this series of posts.

Understanding The “Unable to save the changes since the new dataset of size” Error In Power BI

If you’re refreshing a Power BI dataset that contains a large amount of data you may encounter the following error:

Data source error Unable to save the changes since the new dataset of size 1120 MB exceeds the limit of 1024 MB. If the dataset is hosted on a PowerBI Premium capacity, you may also reach out to your capacity administrator to see if the limit can be adjusted.

You’ll get this error at the end of a refresh if the size of your dataset is larger than the size allowed by Power BI. The maximum size of a dataset varies depending on whether you are using Shared capacity (commonly known as “Pro”) or Premium capacity:

In Shared/Pro the maximum size of a dataset is 1GB (ie the 1024MB from the error message shown above), as documented here.
In Premium:
- The default maximum size of a dataset is 10GB if the “Large dataset storage format” option (documented here and here) is turned off for your dataset.
- If the “Large dataset storage format” option is turned on, the maximum size of your dataset depends on the amount of memory available in the SKU of the Power BI Premium capacity you’re using. There’s a table in the documentation that tells you how much memory is available for each SKU: look in the “RAM per dataset” column and also note what it says in the footnotes about the maximum size of a dataset being slightly less than the amount of memory listed.

What can you do if you run into this error? You have two choices:

If you’re using a Power BI Pro licence, buy Premium Per User licences or buy Premium capacity and move your dataset into a workspace on PPU or Premium. If you’re already on Premium, upgrade your capacity and make sure that the “Large dataset storage format” option is turned on for your dataset. This will increase the dataset size limits.
Reduce the size of your dataset so it fits within the limits.

Option #2 is likely to be the preferred option in most cases. There are a lot of detailed blog posts and articles out there on how to reduce the size of your dataset (including this one in the docs) but basically you should use DAX Studio’s Model Metrics feature to see what’s taking up the most space in your dataset then delete unnecessary tables and columns, try reduce the number of distinct values in each column, and reduce the number of rows you’re loading into each table (perhaps by reducing the amount of historic data). Quite often a few small changes can significantly shrink the size of your dataset and also help refresh times and query performance.

Speed Up Power BI Dataset Refresh Performance In Premium Or PPU By Changing The “Parallel Loading Of Tables” Setting

Do you have a a large dataset in Power BI Premium or Premium Per User? Do you have more than six tables that take a significant amount of time to refresh? If so, you may be able to speed up the performance of your dataset’s refresh by increasing the number of tables that are refreshed in parallel, using a feature that was released in August 2022 but which you may have missed.

Some of you may be thinking: haven’t you blogged about refresh parallelism before? Yes: last year I wrote a post about setting the maxParallelism property when refreshing through TMSL scripts, a few months ago I also showed how you could change the same property when refreshing using the Enhanced Refresh API. That was all fairly complex though and what I’m going to show you here is, in contrast, very easy to implement.

To illustrate this I created an Import dataset containing nine tables, each of which loaded data from different CSV files. Each table contained a couple of million rows and took 20-30 seconds to refresh. I then published to a PPU workspace in the Power BI Service and used the technique described in Phil Seamark’s “Visualise your Power BI refresh” blog post to capture what happened during refresh. The refresh took 44 seconds as as you can see from this screenshot of Phil’s report, only six of the tables were refreshed in parallel at any one time:

[Ignore the y axis on this graph – I have no idea what’s going on with it]

This is as you would expect: the default number of objects that can be refreshed in parallel in Premium or Premium Per User is 6. This is also the maximum number of objects that can be refreshed in parallel in Shared capacity.

I then went to my original .pbix file, opened the Options dialog, went to the Current File/Data Load pane and changed the new “Parallel loading of tables” setting on my dataset from Default to Custom and entered the value 9:

This setting sets the maxParallelism property in the refresh commands generated by Power BI in the Power BI Service; it is fully documented here. Rerunning the refresh again in the Power BI Service I found that the overall time taken was down to 29 seconds and that all the tables were refreshed in parallel:

Here’s the Refresh command from the Profiler trace showing maxParallelism set to 9:

In summary: refresh performance got1/3 faster by making this simple change, so if you’re using Power BI Premium and not already doing advanced things with TMSL scripts or the Enhanced Refresh API, this is definitely something you should experiment with. As I said last year you can try setting this setting up to a value of 30 but there’s no guarantee you’ll get the amount of parallelism you ask for (it depends on the Premium capacity SKU and a few other factors). What’s more, setting this property too high could result in slower refresh because it might result in Power BI overloading your data source with queries.

[Thanks to Akshai Mirchandani for answering my questions about this subject]

Calling The Power BI Enhanced Refresh API From Power Automate, Part 6: Cancelling Dataset Refreshes

The thing that got me excited about the Power BI Enhanced Refresh API, and which inspired me to start this series of posts, was the fact that for the first time it gives you a way of cancelling Power BI dataset refreshes. In this last post in the series I’ll show you how you can add an Action to your Power Automate custom connector to cancel a refresh and how you can use it in a Power Automate Flow.

Adding a new Action is straightforward and very similar to what I’ve shown in previous posts in this series. As before, click on the New action button on the Definition stage of the custom connector wizard, fill in the General section with appropriate names:

…then use the Import from sample option to populate the Request section using the DELETE verb and the following URL for the Cancel Refresh In Group API endpoint:

https://api.powerbi.com/v1.0/myorg/groups/{groupId}/datasets/{datasetId}/refreshes/{refreshId}

The refreshId, which uniquely identifies the refresh operation you which to cancel can be obtained from the Action I described in part 4 of this series that can be used to kick off a refresh.

Using this new Action in a Flow is more interesting. Here’s an example Flow that kicks off a dataset refresh but cancels it if it goes on for more than 120 seconds; it’s similar to, but more complex than, the Flow I described in this post that checks the status of a refresh. At the top level it looks like this:

The first two actions kick off a full refresh of a dataset:

The next two Actions initialise two variables:

ElapsedRefreshDurationSeconds is an integer used to hold the amount of time in seconds that the refresh has been running for each time it is checked
StopCheckingRefreshStatus is a boolean variable used to determine whether the upcoming Do until loop should be terminated or not

The contents of the Do until loop look like this:

After waiting 30 seconds the status of the refresh is checked. If the status is “Unknown” then the refresh is ongoing and the next step is to check how long it has been running for; otherwise the refresh has completed at the StopCheckingRefreshStatus variable can be set to true so the Do until loop can be terminated. Drilling into the “Is refresh ongoing” conditional Action:

If the refresh is still in progress then the current elapsed duration in seconds of the refresh is calculated using the following expression (this blog post showed me how to calculate durations in Power Automate):

div(sub(ticks(utcNow()), ticks(outputs('Get_status_of_one_dataset_refresh')?['body/startTime'])),10000000)

If that elapsed duration is more than 120 seconds then the refresh should be cancelled; otherwise no action is taken. Drilling into the Yes branch of the conditional Action:

The refresh is cancelled by calling the Action created at the beginning of this post; after the cancellation the StopCheckingRefreshStatus variable is set to true. Here are details of the Actions from the Yes branch that do this:

That’s it for this post and this series. Since I started the series over a month ago a few other articles on similar subjects have been published on LinkedIn that you might want to check out, by my colleagues Romain Casteres and Rui Romano; as more enterprise BI solutions are built on Power BI, I know there will be a lot more people implementing complex refresh scenarios using Power Automate/Logic Apps and the Enhanced Refresh API.

Calling The Power BI Enhanced Refresh API From Power Automate, Part 5: Getting A Dataset’s Refresh History

In the last post in this series I showed how to get the status of an individual Power BI dataset refresh. In this post I’ll show how to get the history of a dataset’s refreshes and save it to a file in OneDrive for Business – useful if you want to do some analysis on refresh performance or the causes of failures.

This is one of the most straightforward things to do with the Power BI Enhanced Refresh API , especially once you’ve already got a custom connector created in the way I’ve shown in the previous posts in this series. You just need to add a new Action to your custom connector, fill in the General section:

…then in the Request section use Import from sample with the GET verb on the following URL (for Get Refresh History In Group):

https://api.powerbi.com/v1.0/myorg/groups/{groupId}/datasets/{datasetId}/refreshes?$top={$top}

The only thing to do after this is edit the optional $top query parameter (which allows you to limit the number of refreshes whose history you return) so it takes an integer rather than a string:

Update the custom connector and you can now use it in a simple Flow to save the JSON returned by the call to the API into a JSON file:

You can then connect to this data very easily via Power Query, for example:

Rather than trigger a Flow like this manually, in the real world you’d want to run it on a schedule, perhaps every day. You’d then end up with a folder full of JSON files you could analyse and as much history stored as you wanted. It should be possible to load and dedupe this data (multiple JSON files will contain overlapping refresh history) using Power BI dataflows and/or datamarts but that’s something I’ll leave for another blog post…

Calling The Power BI Enhanced Refresh API From Power Automate, Part 4: Getting The Status Of a Refresh

So far in this series (see part 1, part 2 and part 3) I’ve looked at how you can create a Power Automate custom connector that uses the Power BI Enhanced Refresh API to kick off a dataset refresh. That’s only half the story though: once the refresh has been started you need to know if it has finished and, if so, whether it finished successfully or not. In this post I’ll show how to do this.

When you start a refresh using the Enhanced Refresh API Power BI returns a unique identifier for that refresh operation and you will need to modify your Power Automate custom connector to make it easy to capture and use that identifier. You can do this on the Definition stage of the custom connector wizard for an Action that kicks off a refresh (ie any of the Actions I built in the previous posts in this series) by going to the Response section, clicking on the Add default response button and pasting the following sample response (from here in the docs) into the Headers box and clicking Import:

x-ms-request-id: 87f31ef7-1e3a-4006-9b0b-191693e79e9e
Location: https://api.powerbi.com/v1.0/myorg/groups/f089354e-8366-4e18-aea3-4cb4a3a50b48/datasets/cfafbeb1-8037-4d0c-896e-a46fb27ff229/refreshes/87f31ef7-1e3a-4006-9b0b-191693e79e9e

It’s the x-ms-request-id custom header that will contain the unique identifier for the refresh.

The next step is to create a new Action to check the status of the refresh using the Get Refresh Execution Details In Group API endpoint. To do this, click on the New action button on the Definition stage of the custom connector wizard and fill in the details in the General section:

Then fill in the Request section. Using Import from sample, select the verb GET, enter the following URL:

https://api.powerbi.com/v1.0/myorg/groups/{groupId}/datasets/{datasetId}/refreshes/{refreshId}

Then in the Response section click Add default response and paste in the (rather long) sample JSON response from the docs found here. The response should look like this:

You can now use this new Action in a Flow. Here’s a simple example:

The first two Actions here, “Manually trigger a flow” and “Incremental Refresh”, kick off a refresh in the way I’ve shown in the previous posts in this series. The “Do until” Action is where the interesting stuff happens:

What this does is:

First of all, the Delay Action waits for 30 seconds. There’s no point checking the status of a refresh immediately after it has been started, and in this case 30 seconds is a reasonable amount of time to wait. Depending on how long your refresh takes you may want to use a longer delay.
Then call the new Action created above to check the status of the refresh that has just been started, using the x-ms-request-id value returned by the Incremental Refresh Action.
If the status (returned by the Get status of one dataset refresh Action) is not Unknown then the refresh has completed and the Do until loop can be terminated. If the status is Unknown then the refresh has either not started or is still in progress so the loop should be run again, ie it will wait for another 30 seconds and then check the status again.

Finally, once the refresh has completed the Condition Action sends an email telling me the outcome: if the status is Completed then the refresh has succeeded; if it is not then the status tells you at a high-level what went wrong (you can look at the extendedStatus and the contents of the objects JSON array for more details).

Calling The Power BI Enhanced Refresh API From Power Automate, Part 3: Incremental Refresh Options

If you’ve read part 1 or part 2 of this series you’ll know how you can create a Power Automate custom connector to call the Power BI Enhanced Refresh API and get fine-grained control over your refreshes. In this post I will take a look at the two parameters in the Enhanced Refresh API for datasets that use incremental refresh: applyRefreshPolicy and effectiveDate. They are documented here but, as always, some worked examples are helpful to really understand how they work.

Before I carry on, I’ll assume you have created a custom connector for Power Automate that supports these two parameters. The easiest way to do this is to use the following JSON for the body when using “Import from sample” to create a new Action:

{
    "type": "Full",
    "commitMode": "transactional",
    "applyRefreshPolicy": true,
    "effectiveDate": "12/31/2013"
}

The Action should look like this in the Test stage of the custom connector wizard:

Let’s start with effectiveDate. By default, incremental refresh allows you to do things like “refresh only the last month of data” with the unstated assumption that “last month” means “last month relative to today’s date“. If the data in a table is loaded on a monthly, quarterly or yearly basis then this default behaviour of incremental refresh may lead to unexpected performance problems or data loss. The effectiveDate parameter allows you to specify a date to use instead of today’s date as the starting point for incremental refresh.

For example, I have a dataset connected to the old Adventure Works DW SQL Server sample database. If I configure incremental refresh on the FactInternetSales table as follows:

…and refresh in the Power BI Service, I can see in SQL Server Management Studio that the following partitions have been created to hold the data in the table:

As you can see I have yearly partitions created relative to today’s date. Unfortunately my old copy of the Adventure Works DW database only has data for the years 2001 to 2004, which means that after the refresh no data is loaded into the dataset. However, if I refresh from Power Automate using my custom connector and I set effectiveDate to 31st December 2004 like so:

…then the following partitions get created in the FactInternetSales table and all my data is loaded into them:

Now let’s consider applyRefreshPolicy. If you have set up incremental refresh on a table then when you do a normal scheduled refresh of your dataset only some of the data is reloaded, saving you a lot of time. However there will be occasions where you want to override this behaviour and force a full refresh of a table even though incremental refresh has been configured: for example some values in your historic data may have been updated, so the only safe way to ensure the correct data is in Power BI is to reload the entire table. I’ve seen people schedule a full refresh of their data every week or every month just in case there were any updates that they weren’t informed of. Setting applyRefreshPolicy to false (the default is true) allows you to do this.

For example, the FactInternetSales table in the dataset described above is configured to refresh only the last two years of data after the initial full refresh, which means that with an effectiveDate of 31st December 2004 only the 2004 and 2003 partitions are refreshed. However, if you set applyRefreshPolicy to false (in Power Automate false for a boolean parameter will appear as a “No”), like so:

…then all partitions in the FactInternetSales table will get refreshed.

It’s important to mention that the applyRefreshPolicy parameter only works with certain settings for the type parameter, as described here, and certain settings for the commitMode parameter, as described here.

Of course you can use these two parameters in TMSL refresh commands without using the Enhanced Refresh API or Power Automate, but I think they are particularly useful when used from a Power Automate custom connector because of the extra refresh scheduling flexibility you get with Power Automate: you may want to run a normal refresh every day but a full refresh with applyRefreshPolicy set to false once a week, for example, or you may want to only refresh one fact table out of many every quarter and set effectiveDate when you do that.

Calling The Power BI Enhanced Refresh API From Power Automate, Part 2: Refreshing Specific Tables And Partitions

In part 1 of this series I showed how you could create a very simple custom connector for Power Automate that allows you to call the Power BI Enhanced Refresh API. However, I didn’t show what I think is the main reason you’d want to build a custom connector: the ability to refresh specific tables or partitions in a dataset rather than the entire dataset. The main reason you’d want to do this is to speed up refresh performance, and there are two reasons why it can do so:

Some tables in your dataset may not need refreshing every time because the data in them hasn’t changed, so not refreshing them saves you time and CPU.
Refreshing the partitions in a table and increasing the maxParallelism property can also significantly speed up refresh performance, as I showed here.

This is a very common requirement and there are a lot of blog posts and videos out there on the subject but I think using a Power Automate custom connector is probably the easiest way of doing this, although it only works for datasets in a Premium or PPU capacity.

To implement an Action that does this in an existing connector, click on the New Action button and follow the instructions from my last post. Fill in the fields in the General box:

Then click on the Import from sample button, select the POST verb, enter the URL

	https://api.powerbi.com/v1.0/myorg/groups/{groupId}/datasets/{datasetId}/refreshes

…leave the Headers box blank and then, in the Body box, enter the sample Body from here:

{
    "type": "Full",
    "commitMode": "transactional",
    "maxParallelism": 2,
    "retryCount": 2,
    "objects": [
        {
            "table": "DimCustomer",
            "partition": "DimCustomer"
        },
        {
            "table": "DimDate"
        }
    ]
}

[Note that you can’t have two Actions with the same URL in the same custom connector]

There’s some extra work to do here though. Click on the dropdown on the body parameter in the Request box and click Edit:

Next, click Edit under “table” in the Parameter section:

…and edit the title and description to reflect the fact that this parameter will contain the list of tables and parameters you want to be refreshed:

Once you’ve saved the connector you can test it – but one thing that caught me out is that the Test step in the custom connector designer doesn’t actually handle the tables and partitions parameter properly (something mentioned in passing here) so you’ll want to test this in a real flow. Here’s an example that refreshes the DimDate table and two partitions (FactInternetSales2001 and FactInternetSales2003) of the FactInternetSales table:

Notice that you can add as many tables or partitions to be refresh as you want by clicking the Add new item button; if you want to refresh a table just enter the table name and leave the partition box empty, but if you want to refresh a partition you need to fill in both the table name and the partition name.

And that’s it, at least for now. Next time I’ll take a look at some of the options for datasets that use incremental refresh.

Calling The Power BI Enhanced Refresh API From Power Automate, Part 1: Creating A Basic Custom Connector

I love the new Power BI Enhanced Refresh API: it allows you to do things like refresh individual tables in your dataset, override incremental refresh policies, control the amount of parallelism, cancel refreshes and a lot more, while being easier to use than the XMLA Endpoint. However, like the XMLA Endpoint, one problem remains: how can you schedule a dataset refresh using it? One option is to create a custom connector for Power Automate (similar to what I described here for the Export API, before the Power BI export actions for Power Automate had been released): this not only allows you to schedule more complex refreshes but also gives you more flexibility over scheduling and do things like send emails if refreshes fail.

There’s no point going into the details of creating a custom connector for a Power BI API endpoint because it’s been done before, most notably by Konstantinos Ioannou who has a very detailed walkthrough here which I strongly suggest you read. There’s only one thing that has changed since he wrote that post: the Power BI App Registration Tool is now here. You also need to give the app you create the “Read and write all datasets” permission:

When you get to the Definition stage of creating the connector there are some choices to make. The Enhanced Refresh API has a lot of functionality and it could be very complicated to build a custom connector that supports everything – especially if you or your users don’t need all that functionality, or if a lot of options could confused your users. As a result it could be better to only expose a subset of the functionality – and that’s what I’ll do in this first post.

Let’s take a few basic options to start off with: the refresh type (ie do you want to do a full refresh, clear the data out of the dataset etc?), the commit mode (do you want everything you’re refreshing to be refreshed in single transaction?), the maximum amount of parallelism and the number of retries if refresh fails. Click on the New Action button and fill in the details in the General section:

Then, in the Request section, click on Import from sample and select the verb POST, enter the following URL:

https://api.powerbi.com/v1.0/myorg/groups/{groupId}/datasets/{datasetId}/refreshes

…leave the Headers box empty and then enter the following in the Body box:

{
    "type": "Full",
    "commitMode": "transactional",
    "maxParallelism": 2,
    "retryCount": 2
}

This will create an Action that allows you to set the type, commitMode, maxParallelism and retryCount options. The Request section of the Definition step should look like this:

You can finish creating the connector as per Konstantinos’s instructions after that.

Finally, to test this connector in a flow, you can build a new instant cloud flow that looks like this:

[You can find the workspace ID (the groupId for the API) and the dataset ID by going to the dataset’s Settings page in the Power BI Service and getting them from the URL as detailed here]

You will also need to turn off the Asynchronous Pattern option in the Settings dialog of the action:

You should now have a flow which can kick off a dataset refresh with a few options. This is only the beginning though: there are more options that can be added, and this flow only starts a refresh – it doesn’t tell you whether the refresh succeeded or not, or allow you to cancel a refresh, or anything else fun like that. We’ll investigate all of these things and more in future posts in this series.

Custom Queries For “Detect Data Changes” In Power BI Incremental Refresh

One feature of Power BI incremental refresh I’ve always been meaning to test out is the ability to create your own M queries to work with the “detect data changes” feature, and last week I finally had the chance to do it. The documentation is reasonably detailed but I thought it would be a good idea to show a worked example of how to use it to get direct control over what data is refreshed during an incremental refresh.

First of all I created a simple dataset with incremental refresh enabled. The source was a SQL Server table with two columns: Date (actually a datetime column) and Sales.

I then configured incremental refresh as follows:

In the background this created six yearly partitions:

Nothing interesting here so far, but the real challenge lies ahead: how exactly do you use custom queries with “detect data changes”?

I created a new table in my SQL Server database called DetectDataChangesTable with one row for every partition in the dataset (even though the incremental refresh configuration above means only the 2021 and 2022 partitions will ever be refreshed) and the values for the RangeStart and RangeEnd M parameters that would be set when each partition is refreshed:

I then created an M query in my dataset called DetectDataChangesQuery that connected to this table, filtered the RangeStart column by the current value of the RangeStart M parameter and the RangeEndColumn by the current value of the RangeEnd M parameter, and then returned just the Output column:

let
  Source = Sql.Databases(
    "ThisIsMySQLServerName"
  ),
  IncrementalRefreshDemo = Source
    {[Name = "IncrementalRefreshDemo"]}
    [Data],
  dbo_DetectDataChangesTable
    = IncrementalRefreshDemo
    {
      [
        Schema = "dbo",
        Item = "DetectDataChangesTable"
      ]
    }
    [Data],
  FilterByParams = Table.SelectRows(
    dbo_DetectDataChangesTable,
    each [RangeStart]
      = RangeStart and [RangeEnd]
      = RangeEnd
  ),
  #"Removed Other Columns"
    = Table.SelectColumns(
    FilterByParams,
    {"Output"}
  )
in
  #"Removed Other Columns"

Here’s the output of the query in the Power Query Editor with the RangeStart M parameter set to 1/1/2021 and the RangeEnd M parameter set to 1/1/2022:

The important thing to point out here is that while the documentation says the query must return a scalar value, in fact the query needs to return a table with one column and one row containing a single scalar value.

After publishing the dataset once again, then next thing to do was to set the pollingExpression property described in the documentation. I did this by connecting to the dataset via the XMLA Endpoint using Tabular Editor 3, then clicking on the Sales table and looking in the Refresh Policy section in the Properties pane. I set the property to the name of the query I just created, DetectDataChangesQuery:

I then forced a full refresh of the Sales table, including all partitions, by running a TMSL script in SQL Server Management Studio and setting the applyRefreshPolicy parameter to false, as documented here. Here’s the TMSL script:

{
  "refresh": {
    "type": "full",
	"applyRefreshPolicy": false,
    "objects": [
      {
        "database": "IncrementalRefreshDetectDataChangesTest",
        "table": "Sales"
      }
    ]
  }
}

Scripting the entire table out to TMSL I could then see the refreshBookmark property on the two partitions (2021 and 2022) which could be refreshed in an incremental refresh set to 1, the value returned for those partitions in the Output column of the DetectDataChangesQuery query:

The refreshBookmark property is important because it stores the value that Power BI compares with the output of the DetectDataChangesQuery query on subsequent dataset refreshes to determine if the partition needs to be refreshed. So, in this case, the value of refreshBookmart is 1 for the 2021 partition but if in a future refresh the DetectDataChangesQuery returns a different value for this partition then Power BI knows it needs to be refreshed.

I then went back to the DetectDataChangesTable table in SQL and set the Output column to be 2 for the row relating to the 2021 partition:

Next, went back to SQL Server Management Studio and refreshed the table using a TMSL script with applyRefreshPolicy set to true (which is the default, and what would happen if you refreshed the dataset through the Power BI portal).

{
  "refresh": {
    "type": "full",
	"applyRefreshPolicy": true,
    "objects": [
      {
        "database": "IncrementalRefreshDetectDataChangesTest",
        "table": "Sales"
      }
    ]
  }
}

In the Messages pane of the query window I saw that Power BI had detected the value returned by DetectDataChangesQuery for the 2021 partition had changed, and that therefore the partition needed to be refreshed:

Lower down in the Messages pane the output confirmed that only the 2021 partition was being refreshed:

In Profiler I saw three SQL queries. The first two were to query the DetectDataChangesTable table for the two partitions that might be refreshed to check to see if the value returned in the Output column was different:

select [_].[Output]
from [dbo].[DetectDataChangesTable] as [_]
where ([_].[RangeStart] = convert(datetime2, '2022-01-01 00:00:00') 
and [_].[RangeStart] is not null) 
and ([_].[RangeEnd] = convert(datetime2, '2023-01-01 00:00:00') 
and [_].[RangeEnd] is not null)

select [_].[Output]
from [dbo].[DetectDataChangesTable] as [_]
where ([_].[RangeStart] = convert(datetime2, '2021-01-01 00:00:00') 
and [_].[RangeStart] is not null) 
and ([_].[RangeEnd] = convert(datetime2, '2022-01-01 00:00:00') 
and [_].[RangeEnd] is not null)

The third was to get the data for the 2021 partition, which was the only partition that needed to be refreshed:

select [_].[Date],
    [_].[Sales]
from [dbo].[Sales] as [_]
where [_].[Date] >= convert(datetime2, '2021-01-01 00:00:00') 
and [_].[Date] < convert(datetime2, '2022-01-01 00:00:00')

Finally, scripting the Sales table again to TMSL after the refresh had completed showed that the refreshBookmark property had changed to 2 for the 2021 partition:

And that’s it. I really like this feature but I’ve never seen anyone use this in the real world though, which is a shame. Maybe this blog will inspire someone out there to try it in production?

[UPDATE] An extra point to add is that if you use this functionality, the data source for the partitions must be the same as the data source used for the polling queries, otherwise you’ll get an error.

Category: Refresh

Why You Should Defragment Your Fact Tables If You’re Using Incremental Refresh In Power BI Premium

Like this:

Understanding The “Unable to save the changes since the new dataset of size” Error In Power BI

Like this:

Speed Up Power BI Dataset Refresh Performance In Premium Or PPU By Changing The “Parallel Loading Of Tables” Setting

Like this:

Calling The Power BI Enhanced Refresh API From Power Automate, Part 6: Cancelling Dataset Refreshes

Like this:

Calling The Power BI Enhanced Refresh API From Power Automate, Part 5: Getting A Dataset’s Refresh History

Like this:

Calling The Power BI Enhanced Refresh API From Power Automate, Part 4: Getting The Status Of a Refresh

Like this:

Calling The Power BI Enhanced Refresh API From Power Automate, Part 3: Incremental Refresh Options

Like this:

Calling The Power BI Enhanced Refresh API From Power Automate, Part 2: Refreshing Specific Tables And Partitions

Like this:

Calling The Power BI Enhanced Refresh API From Power Automate, Part 1: Creating A Basic Custom Connector

Like this:

Custom Queries For “Detect Data Changes” In Power BI Incremental Refresh

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this: