Incremental Refresh On Delta Tables In Power BI

One of the coolest features in Fabric is Direct Lake mode, which allows you to build Power BI reports directly on top of Delta tables in your data lake without having to wait for a semantic model to refresh. However not everyone is ready for Fabric yet so there’s also a lot of interest in the new DeltaLake.Table M function which allows Power Query (in semantic models or dataflows) to read data from Delta tables. If you currently have a serving layer – for example Synapse Serverless or Databricks SQL Warehouse – in between your existing lake house and your import mode Power BI semantic models then this new function could allow you to remove it, to reduce complexity and cut costs. This will only be a good idea, though, if refresh performance isn’t impacted and incremental refresh can be made to work well.

So is it possible to get good performance from DeltaLake.Table with incremental refresh? Query folding isn’t possible using this connector because there’s no database to query: a Delta table is just a folder with some files in. But query folding isn’t necessary for incremental refresh to work well: what’s important is that when Power Query filters a table by the datetime column required for incremental refresh, that query is significantly faster than reading all the data from that table. And, as far as I can see from the testing I’ve done, because of certain performance optimisations within DeltaLake.Table it should be possible to use incremental refresh on a Delta table successfully.

There are three factors that influence the performance of Power Query when querying a Delta table:

  1. The internal structure of the Delta table, in particular whether it is partitioned or not
  2. The implementation of the connector, ie the DeltaLake.Table function
  3. The M code you write in the queries used to populate the tables in your semantic model

There’s not much you can do about #2 – performance is, I think, good enough right now although there are a lot of optimisations that will hopefully come in the future – but #1 and #3 are definitely within your control as a developer and making the right choices makes all the difference.

Here’s what I did to test incremental refresh performance. First, I used a Fabric pipeline to load the NYC Taxi sample data into a table in a Lakehouse (for the purposes of this exercise a Fabric Lakehouse will behave the same as ADLSgen2 storage – I used a Lakehouse because it was easier). Then, in Power BI Desktop, I created an import mode semantic model pointing to the NYC taxi data table in the Lakehouse and configured incremental refresh. Here’s the M code for that table:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/unpartitionednyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
#"Filtered Rows" = Table.SelectRows(
ToDelta,
each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd
)
in
#"Filtered Rows"

Here’s the incremental refresh dialog:

I then published the semantic model and refreshed it via the Enhanced Refresh API from a notebook (Semantic Link makes this so much easier) using an effective date of 8th December 2013 to get a good spread of data. I used Phil Seamark’s new, notebook-based version of his refresh visualisation tool to see how long each partition took during an initial refresh:

The refresh took just over 30 minutes.

Next, using Spark SQL, I created a copy of the NYC taxi data table in my Lakehouse with a new datetime column added which removed everything apart from the date and I then partitioned the table by that new datetime column (called PickupDate here):

CREATE TABLE PartitionedByDateNYC

USING delta
PARTITIONED BY (PickupDate)
AS
SELECT *, date_trunc("Day", lpepPickupDateTime) as PickupDate
FROM NYCIncrementalRefreshTest.nyctaxi_raw

I created a copy of my semantic model, pointed it to the new table and reconfigured the incremental refresh to filter on the newly-created PickupDate column:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionedbydatenyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
#"Filtered Rows" = Table.SelectRows(
ToDelta,
each [PickupDate] >= RangeStart and [PickupDate] < RangeEnd
)
in
#"Filtered Rows"

…and refreshed again. This time the refresh took about 26 seconds.

Half an hour to 26 seconds is a big improvement and it’s because the DeltaLake.Table function is able to perform partition elimination: the partitions in the semantic model align to one or more partitions in the Delta table, so when each partition in the semantic model is refreshed Power Query only needs to read data from the partitions in the Delta table that contain the relevant data. This only happens because the filter in the Power Query query using the RangeStart and RangeEnd parameters is on the same column that is used to partition the Delta table.

In my final test I partitioned my Delta table by month, like so:

CREATE TABLE PartitionedNYC

USING delta
PARTITIONED BY (PickupYearMonth)
AS
SELECT *, (100*date_part('YEAR', lpepPickupDateTime)) + date_part('Months', lpepPickupDateTime) as PickupYearMonth
FROM NYCIncrementalRefreshTest.nyctaxi_raw

The challenge here is that:

  1. The new PickupYearMonth column is an integer column, not a datetime column, so it can’t be used for an incremental refresh filter in Power Query
  2. Power BI incremental refresh creates partitions at the year, quarter, month and date granularities, so filtering by month can’t be used for date partitions

I solved this problem in my Power Query query by calculating the month from the RangeStart and RangeEnd parameters, filtering the table by the PickupYearMonth column (to get partition elimination), stopping any further folding using the Table.StopFolding function and then finally filtering on the same datetime column I used in my first test:

let
Source = AzureStorage.DataLake(
"https://onelake.dfs.fabric.microsoft.com/workspaceid/lakehouseid/Tables/partitionednyc/",
[HierarchicalNavigation = true]
),
ToDelta = DeltaLake.Table(Source),
YearMonthRangeStart = (Date.Year(RangeStart) * 100) + Date.Month(RangeStart),
YearMonthRangeEnd = (Date.Year(RangeEnd) * 100) + Date.Month(RangeEnd),
FilterByPartition = Table.StopFolding(
Table.SelectRows(
ToDelta,
each [PickupYearMonth] >= YearMonthRangeStart and [PickupYearMonth] <= YearMonthRangeEnd
)
),
#"Filtered Rows" = Table.SelectRows(
FilterByPartition,
each [lpepPickupDatetime] >= RangeStart and [lpepPickupDatetime] < RangeEnd
)
in
#"Filtered Rows"

Interestingly this table refreshed even faster: it took only 18 seconds.

This might just be luck, or it could be because the larger partitions resulted in fewer calls back to the storage layer. The AzureStorage.DataLake M function requests data 4MB at a time by default and this could result in more efficient data retrieval for the data volumes used in this test. I didn’t get round to testing if using non-default options on AzureStorage.DataLake improved performance even more (see here for more details on earlier testing I did with them).

To sum up, based on these tests it looks like incremental refresh can be used effectively in import mode semantic models with Delta tables and the DeltaLake.Table function so long as you partition your Delta table and configure your Power Query queries to filter on the partition column. I would love to hear what results you get if you test this in the real world so please let me know by leaving a comment.

Keep Your Existing Power BI Data And Add New Data To It Using Fabric

One of the most popular posts on my blog in the last few years has been this one:

To be honest I’m slightly ashamed of this fact because, as I say in the post, the solution I describe is a bit of a hack – but at the same time, the post is popular because a lot of people have the problem of needing to add new data to the data that’s already there in their Power BI dataset and there’s no obvious way of doing that. As I also say in that post, the best solution is to stage the data in a relational database or some other store outside Power BI so you have a copy of all the data if you ever need to do a full refresh of your Power BI dataset.

Why revisit this subject? Well, with Fabric it’s now much easier for you as a Power BI developer to build that place to store a full copy of your data outside your Power BI dataset and solve this problem properly. For a start, you now have a choice of where to store your data: either in a Lakehouse or a Warehouse, depending on whether you feel comfortable with using Spark and notebooks or relational databases and SQL to manage your data. What’s more, with Dataflows gen2, when you load data to a destination you now have the option to append new data to existing data as well as to replace it:

If you need more complex logic to make sure you only load new records and not ones that you’ve loaded before, there’s a published pattern for that.

“But I’m a Power BI developer, not a Fabric developer!” I hear you cry. Perhaps the most important point to make about Fabric is that Power BI is Fabric. If you have Power BI today, you will have Fabric soon if you don’t have the preview already – they are the same thing. One way of thinking about Fabric is that it’s just Power BI with a lot more stuff in it: databases, notebooks, Spark and pipelines as well as reports, datasets and dataflows. There are new skills to learn but solving this problem with the full range of Fabric workloads is a lot less complex than the pure Power BI approach I originally described.

“But won’t this be expensive? Won’t it need a capacity?” you say. It’s true that to do all this you will need to buy a Fabric capacity. But Fabric capacities start at a much cheaper price than Power BI Premium capacities: an F2 capacity costs $0.36USD per hour or $262.80USD per month and OneLake storage costs $0.023 per GB per month (for more details see this blog post and the docs), so Fabric capacities are a lot more affordable than Power BI Premium capacities.

So, with Fabric, there’s no need for complex and hacky workarounds to solve this problem. Just spin up a Fabric capacity, create a Warehouse or Lakehouse to store your data, use Dataflows Gen2 to append new data to any existing data, then build your Power BI dataset on that.

Custom Queries For “Detect Data Changes” In Power BI Incremental Refresh

One feature of Power BI incremental refresh I’ve always been meaning to test out is the ability to create your own M queries to work with the “detect data changes” feature, and last week I finally had the chance to do it. The documentation is reasonably detailed but I thought it would be a good idea to show a worked example of how to use it to get direct control over what data is refreshed during an incremental refresh.

First of all I created a simple dataset with incremental refresh enabled. The source was a SQL Server table with two columns: Date (actually a datetime column) and Sales.

I then configured incremental refresh as follows:

In the background this created six yearly partitions:

Nothing interesting here so far, but the real challenge lies ahead: how exactly do you use custom queries with “detect data changes”?

I created a new table in my SQL Server database called DetectDataChangesTable with one row for every partition in the dataset (even though the incremental refresh configuration above means only the 2021 and 2022 partitions will ever be refreshed) and the values for the RangeStart and RangeEnd M parameters that would be set when each partition is refreshed:

I then created an M query in my dataset called DetectDataChangesQuery that connected to this table, filtered the RangeStart column by the current value of the RangeStart M parameter and the RangeEndColumn by the current value of the RangeEnd M parameter, and then returned just the Output column:

let
  Source = Sql.Databases(
    "ThisIsMySQLServerName"
  ),
  IncrementalRefreshDemo = Source
    {[Name = "IncrementalRefreshDemo"]}
    [Data],
  dbo_DetectDataChangesTable
    = IncrementalRefreshDemo
    {
      [
        Schema = "dbo",
        Item = "DetectDataChangesTable"
      ]
    }
    [Data],
  FilterByParams = Table.SelectRows(
    dbo_DetectDataChangesTable,
    each [RangeStart]
      = RangeStart and [RangeEnd]
      = RangeEnd
  ),
  #"Removed Other Columns"
    = Table.SelectColumns(
    FilterByParams,
    {"Output"}
  )
in
  #"Removed Other Columns"

Here’s the output of the query in the Power Query Editor with the RangeStart M parameter set to 1/1/2021 and the RangeEnd M parameter set to 1/1/2022:

The important thing to point out here is that while the documentation says the query must return a scalar value, in fact the query needs to return a table with one column and one row containing a single scalar value.

After publishing the dataset once again, then next thing to do was to set the pollingExpression property described in the documentation. I did this by connecting to the dataset via the XMLA Endpoint using Tabular Editor 3, then clicking on the Sales table and looking in the Refresh Policy section in the Properties pane. I set the property to the name of the query I just created, DetectDataChangesQuery:

I then forced a full refresh of the Sales table, including all partitions, by running a TMSL script in SQL Server Management Studio and setting the applyRefreshPolicy parameter to false, as documented here. Here’s the TMSL script:

{
  "refresh": {
    "type": "full",
	"applyRefreshPolicy": false,
    "objects": [
      {
        "database": "IncrementalRefreshDetectDataChangesTest",
        "table": "Sales"
      }
    ]
  }
}

Scripting the entire table out to TMSL I could then see the refreshBookmark property on the two partitions (2021 and 2022) which could be refreshed in an incremental refresh set to 1, the value returned for those partitions in the Output column of the DetectDataChangesQuery query:

The refreshBookmark property is important because it stores the value that Power BI compares with the output of the DetectDataChangesQuery query on subsequent dataset refreshes to determine if the partition needs to be refreshed. So, in this case, the value of refreshBookmart is 1 for the 2021 partition but if in a future refresh the DetectDataChangesQuery returns a different value for this partition then Power BI knows it needs to be refreshed.

I then went back to the DetectDataChangesTable table in SQL and set the Output column to be 2 for the row relating to the 2021 partition:

Next, went back to SQL Server Management Studio and refreshed the table using a TMSL script with applyRefreshPolicy set to true (which is the default, and what would happen if you refreshed the dataset through the Power BI portal).

{
  "refresh": {
    "type": "full",
	"applyRefreshPolicy": true,
    "objects": [
      {
        "database": "IncrementalRefreshDetectDataChangesTest",
        "table": "Sales"
      }
    ]
  }
}

In the Messages pane of the query window I saw that Power BI had detected the value returned by DetectDataChangesQuery for the 2021 partition had changed, and that therefore the partition needed to be refreshed:

Lower down in the Messages pane the output confirmed that only the 2021 partition was being refreshed:

In Profiler I saw three SQL queries. The first two were to query the DetectDataChangesTable table for the two partitions that might be refreshed to check to see if the value returned in the Output column was different:

select [_].[Output]
from [dbo].[DetectDataChangesTable] as [_]
where ([_].[RangeStart] = convert(datetime2, '2022-01-01 00:00:00') 
and [_].[RangeStart] is not null) 
and ([_].[RangeEnd] = convert(datetime2, '2023-01-01 00:00:00') 
and [_].[RangeEnd] is not null)
select [_].[Output]
from [dbo].[DetectDataChangesTable] as [_]
where ([_].[RangeStart] = convert(datetime2, '2021-01-01 00:00:00') 
and [_].[RangeStart] is not null) 
and ([_].[RangeEnd] = convert(datetime2, '2022-01-01 00:00:00') 
and [_].[RangeEnd] is not null)

The third was to get the data for the 2021 partition, which was the only partition that needed to be refreshed:

select [_].[Date],
    [_].[Sales]
from [dbo].[Sales] as [_]
where [_].[Date] >= convert(datetime2, '2021-01-01 00:00:00') 
and [_].[Date] < convert(datetime2, '2022-01-01 00:00:00')

Finally, scripting the Sales table again to TMSL after the refresh had completed showed that the refreshBookmark property had changed to 2 for the 2021 partition:

And that’s it. I really like this feature but I’ve never seen anyone use this in the real world though, which is a shame. Maybe this blog will inspire someone out there to try it in production?

[UPDATE] An extra point to add is that if you use this functionality, the data source for the partitions must be the same as the data source used for the polling queries, otherwise you’ll get an error.

Keep The Existing Data In Your Power BI Dataset And Add New Data To It Using Incremental Refresh

[Update September 2023: now that Power BI is part of Microsoft Fabric, the new features of Fabric make it much easier to solve this problem as described here]

Power BI incremental refresh is a very powerful feature and now it’s available in Shared capacity (not just Premium) everyone can use it. It’s designed for scenarios where you have a data warehouse running on a relational database but with a little thought you can make it do all kinds of other interesting things; Miguel Escobar’s recent blog post on how to use incremental refresh for files in a folder is a great example of this. In this post I’m going to show you how to use incremental refresh to solve another very common problem – namely how to get Power BI to keep the data that’s already in your dataset and add new data to it.

I know what you’re thinking at this point: isn’t this what incremental refresh is meant to do anyway? Well, yes it is, but as I said it’s designed to work in scenarios where a relational data warehouse stores a copy of all the data that’s in your dataset and in some cases you don’t have that luxury. For example, you may want to:

  • Connect to a data source that only gives you a set of new sales transactions each day, and add these sales transactions to the ones you have already stored in a Power BI dataset
  • Take a snapshot of a data source, like an Excel workbook, that is changing all the time and store each of these daily snapshots of the contents of the workbook in a Power BI dataset

The current Power BI incremental refresh functionality doesn’t make it easy to do either of these things, and that’s why I’ve written this post.

Let me be clear though: in all these cases I recommend that you don’t use the technique I’m going to show you. If possible, you should stage a copy of each day’s data in a relational database (ie build that data warehouse) or even as text files in a folder (Power Automate may be useful to do this) and use that staged copy as the data source for Power BI. This will allow you to do a full refresh of the data in your dataset at any point in the future if you need to, or create a completely new dataset, even though it means you have to do a lot of extra work. If it isn’t possible to do this, or you’re too lazy or you’re just curious to see how my technique works, read on.

For the example I’m going to show in this post I’m going to use a web-based data source, an RSS feed from the BBC News website that returns a list of the current top stores on the site. You can find it here:

http://feeds.bbci.co.uk/news/rss.xml

RSS is based on XML and there’s no authentication needed to access this feed, so loading a table of these top stories into Power BI is very easy – so easy, I’m not going to bother explaining how to do it. Instead I’m going to show you how to use incremental refresh to store a copy of this list of top stories every day in the same table in a dataset.

First of all, for incremental refresh to work two Power Query parameters of data type date time need to be created called RangeStart and RangeEnd. More details about how to create them can be found in the docs here. Power BI expects them to be used to filter the rows loaded into a table; it also uses them to partition these tables in the dataset.

Next you need a Power Query query to load the news stories. Here’s the M code for the query:


let
//Connect to the BBC News Top Stories RSS feed
//and create a nicely formatted table
Source = Xml.Tables(Web.Contents("http://feeds.bbci.co.uk/news/rss.xml&quot;)),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Attribute:version", Int64.Type}}),
channel = #"Changed Type"{0}[channel],
#"Changed Type1" = Table.TransformColumnTypes(channel,{{"title", type text}, {"description", type text}, {"link", type text}, {"generator", type text}, {"lastBuildDate", type datetime}, {"copyright", type text}, {"language", type text}, {"ttl", Int64.Type}}),
item = #"Changed Type1"{0}[item],
#"Changed Type2" = Table.TransformColumnTypes(item,{{"title", type text}, {"description", type text}, {"link", type text}, {"pubDate", type datetime}}),
#"Removed Columns" = Table.RemoveColumns(#"Changed Type2",{"guid"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"title", "Title"}, {"description", "Description"}, {"link", "Link"}, {"pubDate", "Publication Date"}}),
//Find the current date and time when this query runs
CurrentDateTime = DateTimeZone.FixedUtcNow(),
//Find yesterday's date
PreviousDay = Date.AddDays(DateTime.Date(CurrentDateTime),-1),
//Put the current date and time in a new column in the table
#"Added Custom" = Table.AddColumn(#"Renamed Columns", "UTC Data Load Date", each CurrentDateTime),
#"Changed Type3" = Table.TransformColumnTypes(#"Added Custom",{{"UTC Data Load Date", type datetimezone}}),
//Add the filter required for incremental refresh
//Only return rows in this table if:
//a) The RangeStart parameter equals yesterday's date, and
//b) RangeEnd is not null (which should never be true)
#"Filtered Rows" = Table.SelectRows(#"Changed Type3", each DateTime.Date(RangeStart)=PreviousDay and RangeEnd<>null)
in
#"Filtered Rows"

As I said, the first couple of steps aren’t interesting – they just connect to the RSS feed to get the list of top stories as a table. After that:

  • The CurrentDateTime step gets the current date and time at the point when the dataset refreshes. It’s important to note that I’ve used the DateTimeZone.FixedUtcNow function – this not only gives you the current UTC date and time, but it’s guaranteed to give you the same date and time every time you call it within a query. If you use DateTimeZone.UtcNow you may get a different date time returned every time the function is called within the query, which can make things very confusing.
  • The PreviousDay step gets yesterday’s date by extracting the date from the value found in the CurrentDateTime step and subtracting one day.
  • The #”Added Custom” step adds a new column to the table called “UTC Data Load Date” containing the value returned by the CurrentDateTime step. As the name suggests, this column shows when the data in any particular row was loaded.
  • The #”Filtered Rows” step is extremely important. It filters the rows in the table: it will return all the rows in the table if the RangeStart parameter returns yesterday’s date, otherwise it will return no rows at all. It also has an AND condition that checks whether the RangeEnd parameter is not null, which should always be true. Without this step that uses RangeStart and RangeEnd in some way you won’t be able to turn on incremental refresh; the significance of only returning data for yesterday’s date will become clear later.

At this point, unless you happen to have set the value of the RangeStart parameter to yesterday’s date, the table returned by the query will be empty.

Once you have loaded this table to your Power BI dataset (I called it “News”) you need to configure incremental refresh on it. You can do this by right-clicking on the table in the Fields pane in the main Power BI window and selecting Incremental refresh from the menu. Here’s how you will need to configure the settings:

IncrementalConfig

You can ignore the warning about query folding and the M query at the top. What you need to do here is:

  • Set the “Incremental refresh” slider to On
  • Under “Store rows in the last” choose Days from the dropdown and enter the number of days you want to store data for. You probably don’t need to store data indefinitely; if you did, your dataset might get very large. Data that is older than the number of days specified here will be deleted from the dataset.
  • Under “Refresh rows in the last” again select Days from the dropdown and enter 1.
  • Do not check the boxes for “Detect data changes” and “Only refresh complete day”.

With that done all you need to do is publish to the Power BI Service and set up scheduled refresh for once a day (and no more). When the report refreshes each day, all the stories listed in the RSS feed will be added to the existing stories in the dataset. To monitor this I created a measure called Story Count to count the number of rows in the News table, and then created a simple report that showed the total value of this measure and also showed it broken down by the UTC Data Load Date column. Here’s what it looked like yesterday (the 12th of April 2020), after it had already been refreshed for a few days:

ReportYesterday

and here’s what it looked like today (the 13th of April 2020), with 51 more stories having been loaded in this morning:

ReportToday

Job done!

The last thing to point out is that connecting to the XMLA endpoint (or should I say connecting via the “Analysis Services protocol”…?) for the workspace using SQL Server Management Studio, if it’s in Premium, makes it a lot easier to understand what’s happening behind the scenes here. If you right-click on your table in SQL Server Management Studio’s Object Explorer pane and select Partitions:

PartitionsSSMS

…You can see the names of the 51 partitions created to hold the data (remember, above we opted to store 50 days of historical data), the number of rows of data in each partition and the dates that these partitions were last refreshed (or “processed” in Analysis Services terminology):

PartitionRefresh

Notice that the partition for today’s date, April 13th, is empty and that both the partition for today and yesterday (April 12th) have been refreshed today; similarly, the partition for April 11th was last refreshed yesterday on April 12th. The two most recent partitions are always refreshed and this is is why the #”Filtered Rows” step in the M code above only returns rows when the RangeStart parameter holds yesterday’s date, to make sure that the stories for the current day are only stored once.

You can also script out the table to TMSL like so:

PartitionsScript

This allows you to see the definitions of each partition and, crucially, the start and end values that are passed to the RangeStart and RangeEnd parameters when they are refreshed:

TOM

You can download my example pbix file for this post here.