Keep The Existing Data In Your Power BI Dataset And Add New Data To It Using Incremental Refresh

Power BI incremental refresh is a very powerful feature and now it’s available in Shared capacity (not just Premium) everyone can use it. It’s designed for scenarios where you have a data warehouse running on a relational database but with a little thought you can make it do all kinds of other interesting things; Miguel Escobar’s recent blog post on how to use incremental refresh for files in a folder is a great example of this. In this post I’m going to show you how to use incremental refresh to solve another very common problem – namely how to get Power BI to keep the data that’s already in your dataset and add new data to it.

I know what you’re thinking at this point: isn’t this what incremental refresh is meant to do anyway? Well, yes it is, but as I said it’s designed to work in scenarios where a relational data warehouse stores a copy of all the data that’s in your dataset and in some cases you don’t have that luxury. For example, you may want to:

  • Connect to a data source that only gives you a set of new sales transactions each day, and add these sales transactions to the ones you have already stored in a Power BI dataset
  • Take a snapshot of a data source, like an Excel workbook, that is changing all the time and store each of these daily snapshots of the contents of the workbook in a Power BI dataset

The current Power BI incremental refresh functionality doesn’t make it easy to do either of these things, and that’s why I’ve written this post.

Let me be clear though: in all these cases I recommend that you don’t use the technique I’m going to show you. If possible, you should stage a copy of each day’s data in a relational database (ie build that data warehouse) or even as text files in a folder (Power Automate may be useful to do this) and use that staged copy as the data source for Power BI. This will allow you to do a full refresh of the data in your dataset at any point in the future if you need to, or create a completely new dataset, even though it means you have to do a lot of extra work. If it isn’t possible to do this, or you’re too lazy or you’re just curious to see how my technique works, read on.

For the example I’m going to show in this post I’m going to use a web-based data source, an RSS feed from the BBC News website that returns a list of the current top stores on the site. You can find it here:

http://feeds.bbci.co.uk/news/rss.xml

RSS is based on XML and there’s no authentication needed to access this feed, so loading a table of these top stories into Power BI is very easy – so easy, I’m not going to bother explaining how to do it. Instead I’m going to show you how to use incremental refresh to store a copy of this list of top stories every day in the same table in a dataset.

First of all, for incremental refresh to work two Power Query parameters of data type date time need to be created called RangeStart and RangeEnd. More details about how to create them can be found in the docs here. Power BI expects them to be used to filter the rows loaded into a table; it also uses them to partition these tables in the dataset.

Next you need a Power Query query to load the news stories. Here’s the M code for the query:

let
//Connect to the BBC News Top Stories RSS feed
//and create a nicely formatted table
Source = Xml.Tables(Web.Contents("http://feeds.bbci.co.uk/news/rss.xml")),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Attribute:version", Int64.Type}}),
channel = #"Changed Type"{0}[channel],
#"Changed Type1" = Table.TransformColumnTypes(channel,{{"title", type text}, {"description", type text}, {"link", type text}, {"generator", type text}, {"lastBuildDate", type datetime}, {"copyright", type text}, {"language", type text}, {"ttl", Int64.Type}}),
item = #"Changed Type1"{0}[item],
#"Changed Type2" = Table.TransformColumnTypes(item,{{"title", type text}, {"description", type text}, {"link", type text}, {"pubDate", type datetime}}),
#"Removed Columns" = Table.RemoveColumns(#"Changed Type2",{"guid"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"title", "Title"}, {"description", "Description"}, {"link", "Link"}, {"pubDate", "Publication Date"}}),
//Find the current date and time when this query runs
CurrentDateTime = DateTimeZone.FixedUtcNow(),
//Find yesterday's date
PreviousDay = Date.AddDays(DateTime.Date(CurrentDateTime),-1),
//Put the current date and time in a new column in the table
#"Added Custom" = Table.AddColumn(#"Renamed Columns", "UTC Data Load Date", each CurrentDateTime),
#"Changed Type3" = Table.TransformColumnTypes(#"Added Custom",{{"UTC Data Load Date", type datetimezone}}),
//Add the filter required for incremental refresh
//Only return rows in this table if:
//a) The RangeStart parameter equals yesterday's date, and
//b) RangeEnd is not null (which should never be true)
#"Filtered Rows" = Table.SelectRows(#"Changed Type3", each DateTime.Date(RangeStart)=PreviousDay and RangeEnd<>null)
in
#"Filtered Rows"

As I said, the first couple of steps aren’t interesting – they just connect to the RSS feed to get the list of top stories as a table. After that:

  • The CurrentDateTime step gets the current date and time at the point when the dataset refreshes. It’s important to note that I’ve used the DateTimeZone.FixedUtcNow function – this not only gives you the current UTC date and time, but it’s guaranteed to give you the same date and time every time you call it within a query. If you use DateTimeZone.UtcNow you may get a different date time returned every time the function is called within the query, which can make things very confusing.
  • The PreviousDay step gets yesterday’s date by extracting the date from the value found in the CurrentDateTime step and subtracting one day.
  • The #”Added Custom” step adds a new column to the table called “UTC Data Load Date” containing the value returned by the CurrentDateTime step. As the name suggests, this column shows when the data in any particular row was loaded.
  • The #”Filtered Rows” step is extremely important. It filters the rows in the table: it will return all the rows in the table if the RangeStart parameter returns yesterday’s date, otherwise it will return no rows at all. It also has an AND condition that checks whether the RangeEnd parameter is not null, which should always be true. Without this step that uses RangeStart and RangeEnd in some way you won’t be able to turn on incremental refresh; the significance of only returning data for yesterday’s date will become clear later.

At this point, unless you happen to have set the value of the RangeStart parameter to yesterday’s date, the table returned by the query will be empty.

Once you have loaded this table to your Power BI dataset (I called it “News”) you need to configure incremental refresh on it. You can do this by right-clicking on the table in the Fields pane in the main Power BI window and selecting Incremental refresh from the menu. Here’s how you will need to configure the settings:

IncrementalConfig

You can ignore the warning about query folding and the M query at the top. What you need to do here is:

  • Set the “Incremental refresh” slider to On
  • Under “Store rows in the last” choose Days from the dropdown and enter the number of days you want to store data for. You probably don’t need to store data indefinitely; if you did, your dataset might get very large. Data that is older than the number of days specified here will be deleted from the dataset.
  • Under “Refresh rows in the last” again select Days from the dropdown and enter 1.
  • Do not check the boxes for “Detect data changes” and “Only refresh complete day”.

With that done all you need to do is publish to the Power BI Service and set up scheduled refresh for once a day (and no more). When the report refreshes each day, all the stories listed in the RSS feed will be added to the existing stories in the dataset. To monitor this I created a measure called Story Count to count the number of rows in the News table, and then created a simple report that showed the total value of this measure and also showed it broken down by the UTC Data Load Date column. Here’s what it looked like yesterday (the 12th of April 2020), after it had already been refreshed for a few days:

ReportYesterday

and here’s what it looked like today (the 13th of April 2020), with 51 more stories having been loaded in this morning:

ReportToday

Job done!

The last thing to point out is that connecting to the XMLA endpoint (or should I say connecting via the “Analysis Services protocol”…?) for the workspace using SQL Server Management Studio, if it’s in Premium, makes it a lot easier to understand what’s happening behind the scenes here. If you right-click on your table in SQL Server Management Studio’s Object Explorer pane and select Partitions:

PartitionsSSMS

…You can see the names of the 51 partitions created to hold the data (remember, above we opted to store 50 days of historical data), the number of rows of data in each partition and the dates that these partitions were last refreshed (or “processed” in Analysis Services terminology):

PartitionRefresh

Notice that the partition for today’s date, April 13th, is empty and that both the partition for today and yesterday (April 12th) have been refreshed today; similarly, the partition for April 11th was last refreshed yesterday on April 12th. The two most recent partitions are always refreshed and this is is why the #”Filtered Rows” step in the M code above only returns rows when the RangeStart parameter holds yesterday’s date, to make sure that the stories for the current day are only stored once.

You can also script out the table to TMSL like so:

PartitionsScript

This allows you to see the definitions of each partition and, crucially, the start and end values that are passed to the RangeStart and RangeEnd parameters when they are refreshed:

TOM

You can download my example pbix file for this post here.

30 responses

  1. “or even as text files in a folder (Power Automate may be useful to do this) and use that staged copy as the data source for Power BI”

    or even better:
    as Excel files as the data source for Excel!
    US government users cannot get a Power BI account. Excel works fine for small to medium size datamarts/reportmarts!

  2. Pingback: Incremental Refresh with Power BI – Curated SQL

  3. Pingback: Dataflows as an Alternative to Incremental Loading in Power BI – Curated SQL

  4. How can Incremental Refresh work on a SharePoint “List”. For instance, when a new item (record, row) is added to a SharePoint List, there is an associated Modified Date. Can an incremental refresh be performed based on a SharePoint List “Item” (row) “Modified Date” ?

    • You would need to use something like the technique Miguel Escobar uses in the blog post I linked to above. Why do you need to do this? Is it really slow to load all the data from your SharePoint list?

  5. This is exactly the kind of pattern i was after for an emailed once a day spreadsheet that is unique each day but wanted to append each day.

    However I thought this could equally be applied to dataflows but I got some really weird behaviour in dataflows. I tried different combinations : creating the parameters first before turning on incremental refresh (query wouldn’t save), turning incr refresh on first and a recurring list of parameters would be auto created down the editor.

    Reverting to trying with dataset….

    • Interesting, I haven’t tried it with dataflows. Since datasets and dataflows are very different under the covers I can believe that a technique that works for one doesn’t work for the other, but I’ll try to investigate when I have a moment.

      • Hi Chris, Thanks for amazing blog!
        This is exactly what I am trying to achieve but with dataflows. I have three different tables to pull the data from. I followed all your steps but couldn’t get the expected result. So I was wondering if you did get a chance to investigate how the Incremental refresh works with dataflows.
        Thanks!

  6. Pingback: Keep The Existing Data In Your Power BI Dataset And Add New Data To It Using Incremental Refresh | Pardaan.com

  7. I have tried above solution but still it’s not adding new data. It is showing new only after refres can anyone help me please ?

  8. Hey buddy.
    How do we dfine the RangeStart and RangeEnd. you are saying yesterdays date but surely it is not a constant data of when we crated this solution.
    Dont we need to put a dynamic value to make the previous days date wheneverthe refresh runs?

  9. Great content!

    I tried using this approach obtaining data from a web source with no avial 🙁

    I am calling a web source (API) and need to have data for the last 1 years. Obviously incremental refresh will be a great option so that I don’t have to pull 12 months of data on every refresh.

    I created the RangeStart and RangeEnd parameters, included them in the web url from and to criteria. I had to create a function to deal with pagination as well so I loop the url for each page of data. I enabled the incremental refresh to store data for the 12 months and refresh for the last 3 days. All looked well in the Desktop version after saving and applying (latest May 2020 release). I published it to the service but when I hit refresh it failed with missing column errors. I suspected that for older dates some columns might not have data so I opened the original desktop file again to troubleshoot. Before I changed the RangeStart date to a recent date just so that I don’t pull thousand of record to the desktop version but I couldn’t apply changes, I receive an ODBC error. The only way to fix it was to change the RangeStart and RangeEnd parameter type to Date from Date/Time which broke the incremental refresh functionality and automatically disabled it which enabled me to apply the changes again. I am not sure if this is an issue with the latest May release or if there are underlying issues using this option for a web source. So I have two issues, one the incremental refresh implemented with a web source URL is not working in the service and I am unable to make changes tot he desktop file once incremental updates are turned on.

    • It’s hard to say what’s going on here, but I suspect you’ll have to find another way to fix the problem without changing the type of the RangeStart and RangeEnd parameters.

  10. Hi Chris,
    I followed your steps, 1st day ran perfecr, but 2nd day is getting empty table,can yo help me? maybe I can send you screenshot or my code??

    • about parameters:
      Current Value of RangeStart was set to 07/08/20 12:00:00 AM
      Current Value of RangeEnd was set to 12/31/20 12:00:00 AM

      There is no relationship between tables RangeStart – RangeEnd – backlog

      First date was ok, but at the 2nd day table was empty.

      see below M code:
      ————
      let
      Source = Excel.Workbook(File.Contents(“C:\Amdocs\Dashboard\backlog.xlsx”), null, true),
      backlog_Sheet = Source{[Item=”backlog”,Kind=”Sheet”]}[Data],
      #”Changed Type” = Table.TransformColumnTypes(backlog_Sheet,{{“Column1″, type text}}),
      #”Promoted Headers” = Table.PromoteHeaders(#”Changed Type”, [PromoteAllScalars=true]),
      #”Changed Type1″ = Table.TransformColumnTypes(#”Promoted Headers”,{{“Inc Call ID”, type text}}),
      //Find the current date and time when this query runs
      CurrentDateTime = DateTimeZone.FixedUtcNow(),
      //Find yesterday’s date
      PreviousDay = Date.AddDays(DateTime.Date(CurrentDateTime),-1),
      //Put the current date and time in a new column in the table
      #”Added Custom” = Table.AddColumn(#”Changed Type1″, “UTC Data Load Date”, each CurrentDateTime),
      #”Changed Type2″ = Table.TransformColumnTypes(#”Added Custom”,{{“UTC Data Load Date”, type datetimezone}}),
      //Add the filter required for incremental refresh
      //Only return rows in this table if:
      //a) The RangeStart parameter equals yesterday’s date, and
      //b) RangeEnd is not null (which should never be true)
      #”Filtered Rows” = Table.SelectRows(#”Changed Type2″, each DateTime.Date(RangeStart)=PreviousDay and RangeEndnull)
      in
      #”Filtered Rows”

      ————–

    • Hi Luis and Chris,

      Same issue on my end! Refreshing the next day just blanks the table. I’ll go double check my steps, but hopefully a Power BI update didn’t break this.

  11. Hi Chris,

    I’m having a problem when trying to use an incremental refresh.

    I’m on Premium, I have a file size of 3GB.

    When I’m testing it seems really inconsistent, I’ve set up me RangeStart and End. On a query folding query. adding the incremental refresh policy to the table.

    Now if I save this file and reopen it, it is almost a lottery as to whether it will recognise the incremental refresh policy is still there. Sometimes it will not find any parameters? (There are parameters as that is how I setup the incremental refresh in the first place). Sometimes it can’t figure out it’s query folding?

    I have most recently refreshed it all, made sure it kept my incremental settings and published it. I have tried to refresh it 4 times since. Once it refreshed it quickly and seemed to work OK (50 mins to refresh in total). Trying to refresh the same file again now and it taking over 3 hours to refresh. When I look at the SQL sessions that are running it looks like it finishes refreshing the whole model, 5 minutes will pass, then it will restart running all of the SQL again, trying to refresh all tables again. In total I think it tries to run all the SQL code 3 times. before it ends up failing.

    Does the size of the pbix file have any impact on Incremental Refresh?

    I’ve looked at the XMLA partitions and can see they are in place. So really struggling to figure out a way around this. Because of the size of the file. It takes a long time to try and diagonose any issues. So was hoping you may have some insight as to whether you have seen or heard of this issue before?

  12. Hi Chris, that was great!

    I’m trying to build a daily tracker for project task progress over time. The issue I have with your method is being able to see only the updated task and missing the full picture of the entire project. As example lets say I have 100 tasks today status (60 hold, 20 started, 20 completed) to track the progress day by day as following:

    – day 2: (55 hold, 22 started, 23 completed)
    – day 3: (50 hold, 25 started, 25 completed)
    – day 3: (50 hold, 20 started, 30 completed)
    etc..
    – day x: (0 hold, 0 started, 100 completed)

    Is there any simple way to do it since my application is very small and not feasible for a data whorehouse?

  13. Cracking piece Chris, Simple read through of the RangeStart and RangeEnd piece, copy and paste of the code from your GitHUB window and I have a self refreshing daily update coming from SalesForce giving me daily data going back 2 months so i can track trends, without having to invest in any costly add-ons to SalesForce.
    You have made a major contribution to Safety within my industry. Thank you.

Leave a Reply to Steven Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: