Keep The Existing Data In Your Power BI Dataset And Add New Data To It Using Incremental Refresh

[Update September 2023: now that Power BI is part of Microsoft Fabric, the new features of Fabric make it much easier to solve this problem as described here]

Power BI incremental refresh is a very powerful feature and now it’s available in Shared capacity (not just Premium) everyone can use it. It’s designed for scenarios where you have a data warehouse running on a relational database but with a little thought you can make it do all kinds of other interesting things; Miguel Escobar’s recent blog post on how to use incremental refresh for files in a folder is a great example of this. In this post I’m going to show you how to use incremental refresh to solve another very common problem – namely how to get Power BI to keep the data that’s already in your dataset and add new data to it.

I know what you’re thinking at this point: isn’t this what incremental refresh is meant to do anyway? Well, yes it is, but as I said it’s designed to work in scenarios where a relational data warehouse stores a copy of all the data that’s in your dataset and in some cases you don’t have that luxury. For example, you may want to:

  • Connect to a data source that only gives you a set of new sales transactions each day, and add these sales transactions to the ones you have already stored in a Power BI dataset
  • Take a snapshot of a data source, like an Excel workbook, that is changing all the time and store each of these daily snapshots of the contents of the workbook in a Power BI dataset

The current Power BI incremental refresh functionality doesn’t make it easy to do either of these things, and that’s why I’ve written this post.

Let me be clear though: in all these cases I recommend that you don’t use the technique I’m going to show you. If possible, you should stage a copy of each day’s data in a relational database (ie build that data warehouse) or even as text files in a folder (Power Automate may be useful to do this) and use that staged copy as the data source for Power BI. This will allow you to do a full refresh of the data in your dataset at any point in the future if you need to, or create a completely new dataset, even though it means you have to do a lot of extra work. If it isn’t possible to do this, or you’re too lazy or you’re just curious to see how my technique works, read on.

For the example I’m going to show in this post I’m going to use a web-based data source, an RSS feed from the BBC News website that returns a list of the current top stores on the site. You can find it here:

http://feeds.bbci.co.uk/news/rss.xml

RSS is based on XML and there’s no authentication needed to access this feed, so loading a table of these top stories into Power BI is very easy – so easy, I’m not going to bother explaining how to do it. Instead I’m going to show you how to use incremental refresh to store a copy of this list of top stories every day in the same table in a dataset.

First of all, for incremental refresh to work two Power Query parameters of data type date time need to be created called RangeStart and RangeEnd. More details about how to create them can be found in the docs here. Power BI expects them to be used to filter the rows loaded into a table; it also uses them to partition these tables in the dataset.

Next you need a Power Query query to load the news stories. Here’s the M code for the query:


let
//Connect to the BBC News Top Stories RSS feed
//and create a nicely formatted table
Source = Xml.Tables(Web.Contents("http://feeds.bbci.co.uk/news/rss.xml")),
#"Changed Type" = Table.TransformColumnTypes(Source,{{"Attribute:version", Int64.Type}}),
channel = #"Changed Type"{0}[channel],
#"Changed Type1" = Table.TransformColumnTypes(channel,{{"title", type text}, {"description", type text}, {"link", type text}, {"generator", type text}, {"lastBuildDate", type datetime}, {"copyright", type text}, {"language", type text}, {"ttl", Int64.Type}}),
item = #"Changed Type1"{0}[item],
#"Changed Type2" = Table.TransformColumnTypes(item,{{"title", type text}, {"description", type text}, {"link", type text}, {"pubDate", type datetime}}),
#"Removed Columns" = Table.RemoveColumns(#"Changed Type2",{"guid"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Columns",{{"title", "Title"}, {"description", "Description"}, {"link", "Link"}, {"pubDate", "Publication Date"}}),
//Find the current date and time when this query runs
CurrentDateTime = DateTimeZone.FixedUtcNow(),
//Find yesterday's date
PreviousDay = Date.AddDays(DateTime.Date(CurrentDateTime),-1),
//Put the current date and time in a new column in the table
#"Added Custom" = Table.AddColumn(#"Renamed Columns", "UTC Data Load Date", each CurrentDateTime),
#"Changed Type3" = Table.TransformColumnTypes(#"Added Custom",{{"UTC Data Load Date", type datetimezone}}),
//Add the filter required for incremental refresh
//Only return rows in this table if:
//a) The RangeStart parameter equals yesterday's date, and
//b) RangeEnd is not null (which should never be true)
#"Filtered Rows" = Table.SelectRows(#"Changed Type3", each DateTime.Date(RangeStart)=PreviousDay and RangeEnd<>null)
in
#"Filtered Rows"

As I said, the first couple of steps aren’t interesting – they just connect to the RSS feed to get the list of top stories as a table. After that:

  • The CurrentDateTime step gets the current date and time at the point when the dataset refreshes. It’s important to note that I’ve used the DateTimeZone.FixedUtcNow function – this not only gives you the current UTC date and time, but it’s guaranteed to give you the same date and time every time you call it within a query. If you use DateTimeZone.UtcNow you may get a different date time returned every time the function is called within the query, which can make things very confusing.
  • The PreviousDay step gets yesterday’s date by extracting the date from the value found in the CurrentDateTime step and subtracting one day.
  • The #”Added Custom” step adds a new column to the table called “UTC Data Load Date” containing the value returned by the CurrentDateTime step. As the name suggests, this column shows when the data in any particular row was loaded.
  • The #”Filtered Rows” step is extremely important. It filters the rows in the table: it will return all the rows in the table if the RangeStart parameter returns yesterday’s date, otherwise it will return no rows at all. It also has an AND condition that checks whether the RangeEnd parameter is not null, which should always be true. Without this step that uses RangeStart and RangeEnd in some way you won’t be able to turn on incremental refresh; the significance of only returning data for yesterday’s date will become clear later.

At this point, unless you happen to have set the value of the RangeStart parameter to yesterday’s date, the table returned by the query will be empty.

Once you have loaded this table to your Power BI dataset (I called it “News”) you need to configure incremental refresh on it. You can do this by right-clicking on the table in the Fields pane in the main Power BI window and selecting Incremental refresh from the menu. Here’s how you will need to configure the settings:

IncrementalConfig

You can ignore the warning about query folding and the M query at the top. What you need to do here is:

  • Set the “Incremental refresh” slider to On
  • Under “Store rows in the last” choose Days from the dropdown and enter the number of days you want to store data for. You probably don’t need to store data indefinitely; if you did, your dataset might get very large. Data that is older than the number of days specified here will be deleted from the dataset.
  • Under “Refresh rows in the last” again select Days from the dropdown and enter 1.
  • Do not check the boxes for “Detect data changes” and “Only refresh complete day”.

With that done all you need to do is publish to the Power BI Service and set up scheduled refresh for once a day (and no more). When the report refreshes each day, all the stories listed in the RSS feed will be added to the existing stories in the dataset. To monitor this I created a measure called Story Count to count the number of rows in the News table, and then created a simple report that showed the total value of this measure and also showed it broken down by the UTC Data Load Date column. Here’s what it looked like yesterday (the 12th of April 2020), after it had already been refreshed for a few days:

ReportYesterday

and here’s what it looked like today (the 13th of April 2020), with 51 more stories having been loaded in this morning:

ReportToday

Job done!

The last thing to point out is that connecting to the XMLA endpoint (or should I say connecting via the “Analysis Services protocol”…?) for the workspace using SQL Server Management Studio, if it’s in Premium, makes it a lot easier to understand what’s happening behind the scenes here. If you right-click on your table in SQL Server Management Studio’s Object Explorer pane and select Partitions:

PartitionsSSMS

…You can see the names of the 51 partitions created to hold the data (remember, above we opted to store 50 days of historical data), the number of rows of data in each partition and the dates that these partitions were last refreshed (or “processed” in Analysis Services terminology):

PartitionRefresh

Notice that the partition for today’s date, April 13th, is empty and that both the partition for today and yesterday (April 12th) have been refreshed today; similarly, the partition for April 11th was last refreshed yesterday on April 12th. The two most recent partitions are always refreshed and this is is why the #”Filtered Rows” step in the M code above only returns rows when the RangeStart parameter holds yesterday’s date, to make sure that the stories for the current day are only stored once.

You can also script out the table to TMSL like so:

PartitionsScript

This allows you to see the definitions of each partition and, crucially, the start and end values that are passed to the RangeStart and RangeEnd parameters when they are refreshed:

TOM

You can download my example pbix file for this post here.

56 thoughts on “Keep The Existing Data In Your Power BI Dataset And Add New Data To It Using Incremental Refresh

  1. “or even as text files in a folder (Power Automate may be useful to do this) and use that staged copy as the data source for Power BI”

    or even better:
    as Excel files as the data source for Excel!
    US government users cannot get a Power BI account. Excel works fine for small to medium size datamarts/reportmarts!

      1. I don’t see the video talking about updating a report, like creating a table. From what I understood ALM toolkit allows you to do metadata updates, but not report level changes like visualizations.

        Any idea on how I can update a report. Thanks!

    1. Hi Jay, you possibly can’ see the code because its being hosted from Github, so if your company blocks Github (mine does) then you won’t see it…

  2. How can Incremental Refresh work on a SharePoint “List”. For instance, when a new item (record, row) is added to a SharePoint List, there is an associated Modified Date. Can an incremental refresh be performed based on a SharePoint List “Item” (row) “Modified Date” ?

    1. You would need to use something like the technique Miguel Escobar uses in the blog post I linked to above. Why do you need to do this? Is it really slow to load all the data from your SharePoint list?

      1. Hi I am having this same issue. Yes, it takes approximately 30 minutes to refresh the dataset and I’d like to know if there’s a way to incrementally refresh on modified date? Thank you!

      2. I have found file operations (read, write) in Sharepoint to be much slower than my alternatives (intranet server or Box). Evidently Sharepoint is not designed to be fast for data storage and retrieval??

      3. Correct- in fact SharePoint can sometimes throttle requests for data. It’s convenient but there are better options for storing files used by Power BI.

      4. So how is it ok that MS 365 seems to recommend using Sharepoint? Why use something new – Sharepoint – when what we were using before is faster and better (Box also saves last 100 versions of each file!). Anyone recommending Sharepoint is doing many of us disservice at best, right?

      5. It depends on what you want to use SharePoint for. Did someone specifically recommend it as a place to store data for Power BI?

  3. This is exactly the kind of pattern i was after for an emailed once a day spreadsheet that is unique each day but wanted to append each day.

    However I thought this could equally be applied to dataflows but I got some really weird behaviour in dataflows. I tried different combinations : creating the parameters first before turning on incremental refresh (query wouldn’t save), turning incr refresh on first and a recurring list of parameters would be auto created down the editor.

    Reverting to trying with dataset….

    1. Interesting, I haven’t tried it with dataflows. Since datasets and dataflows are very different under the covers I can believe that a technique that works for one doesn’t work for the other, but I’ll try to investigate when I have a moment.

      1. Hi Chris, Thanks for amazing blog!
        This is exactly what I am trying to achieve but with dataflows. I have three different tables to pull the data from. I followed all your steps but couldn’t get the expected result. So I was wondering if you did get a chance to investigate how the Incremental refresh works with dataflows.
        Thanks!

  4. I have tried above solution but still it’s not adding new data. It is showing new only after refres can anyone help me please ?

  5. Hey buddy.
    How do we dfine the RangeStart and RangeEnd. you are saying yesterdays date but surely it is not a constant data of when we crated this solution.
    Dont we need to put a dynamic value to make the previous days date wheneverthe refresh runs?

  6. Great content!

    I tried using this approach obtaining data from a web source with no avial 🙁

    I am calling a web source (API) and need to have data for the last 1 years. Obviously incremental refresh will be a great option so that I don’t have to pull 12 months of data on every refresh.

    I created the RangeStart and RangeEnd parameters, included them in the web url from and to criteria. I had to create a function to deal with pagination as well so I loop the url for each page of data. I enabled the incremental refresh to store data for the 12 months and refresh for the last 3 days. All looked well in the Desktop version after saving and applying (latest May 2020 release). I published it to the service but when I hit refresh it failed with missing column errors. I suspected that for older dates some columns might not have data so I opened the original desktop file again to troubleshoot. Before I changed the RangeStart date to a recent date just so that I don’t pull thousand of record to the desktop version but I couldn’t apply changes, I receive an ODBC error. The only way to fix it was to change the RangeStart and RangeEnd parameter type to Date from Date/Time which broke the incremental refresh functionality and automatically disabled it which enabled me to apply the changes again. I am not sure if this is an issue with the latest May release or if there are underlying issues using this option for a web source. So I have two issues, one the incremental refresh implemented with a web source URL is not working in the service and I am unable to make changes tot he desktop file once incremental updates are turned on.

    1. It’s hard to say what’s going on here, but I suspect you’ll have to find another way to fix the problem without changing the type of the RangeStart and RangeEnd parameters.

  7. Hi Chris,
    I followed your steps, 1st day ran perfecr, but 2nd day is getting empty table,can yo help me? maybe I can send you screenshot or my code??

    1. about parameters:
      Current Value of RangeStart was set to 07/08/20 12:00:00 AM
      Current Value of RangeEnd was set to 12/31/20 12:00:00 AM

      There is no relationship between tables RangeStart – RangeEnd – backlog

      First date was ok, but at the 2nd day table was empty.

      see below M code:
      ————
      let
      Source = Excel.Workbook(File.Contents(“C:\Amdocs\Dashboard\backlog.xlsx”), null, true),
      backlog_Sheet = Source{[Item=”backlog”,Kind=”Sheet”]}[Data],
      #”Changed Type” = Table.TransformColumnTypes(backlog_Sheet,{{“Column1″, type text}}),
      #”Promoted Headers” = Table.PromoteHeaders(#”Changed Type”, [PromoteAllScalars=true]),
      #”Changed Type1″ = Table.TransformColumnTypes(#”Promoted Headers”,{{“Inc Call ID”, type text}}),
      //Find the current date and time when this query runs
      CurrentDateTime = DateTimeZone.FixedUtcNow(),
      //Find yesterday’s date
      PreviousDay = Date.AddDays(DateTime.Date(CurrentDateTime),-1),
      //Put the current date and time in a new column in the table
      #”Added Custom” = Table.AddColumn(#”Changed Type1″, “UTC Data Load Date”, each CurrentDateTime),
      #”Changed Type2″ = Table.TransformColumnTypes(#”Added Custom”,{{“UTC Data Load Date”, type datetimezone}}),
      //Add the filter required for incremental refresh
      //Only return rows in this table if:
      //a) The RangeStart parameter equals yesterday’s date, and
      //b) RangeEnd is not null (which should never be true)
      #”Filtered Rows” = Table.SelectRows(#”Changed Type2″, each DateTime.Date(RangeStart)=PreviousDay and RangeEndnull)
      in
      #”Filtered Rows”

      ————–

    2. Hi Luis and Chris,

      Same issue on my end! Refreshing the next day just blanks the table. I’ll go double check my steps, but hopefully a Power BI update didn’t break this.

    3. Hi
      I have the same problem.
      First refresh great – 76 lines
      Next day – nothing – blanc

  8. Hi Chris,

    I’m having a problem when trying to use an incremental refresh.

    I’m on Premium, I have a file size of 3GB.

    When I’m testing it seems really inconsistent, I’ve set up me RangeStart and End. On a query folding query. adding the incremental refresh policy to the table.

    Now if I save this file and reopen it, it is almost a lottery as to whether it will recognise the incremental refresh policy is still there. Sometimes it will not find any parameters? (There are parameters as that is how I setup the incremental refresh in the first place). Sometimes it can’t figure out it’s query folding?

    I have most recently refreshed it all, made sure it kept my incremental settings and published it. I have tried to refresh it 4 times since. Once it refreshed it quickly and seemed to work OK (50 mins to refresh in total). Trying to refresh the same file again now and it taking over 3 hours to refresh. When I look at the SQL sessions that are running it looks like it finishes refreshing the whole model, 5 minutes will pass, then it will restart running all of the SQL again, trying to refresh all tables again. In total I think it tries to run all the SQL code 3 times. before it ends up failing.

    Does the size of the pbix file have any impact on Incremental Refresh?

    I’ve looked at the XMLA partitions and can see they are in place. So really struggling to figure out a way around this. Because of the size of the file. It takes a long time to try and diagonose any issues. So was hoping you may have some insight as to whether you have seen or heard of this issue before?

  9. Hi Chris, that was great!

    I’m trying to build a daily tracker for project task progress over time. The issue I have with your method is being able to see only the updated task and missing the full picture of the entire project. As example lets say I have 100 tasks today status (60 hold, 20 started, 20 completed) to track the progress day by day as following:

    – day 2: (55 hold, 22 started, 23 completed)
    – day 3: (50 hold, 25 started, 25 completed)
    – day 3: (50 hold, 20 started, 30 completed)
    etc..
    – day x: (0 hold, 0 started, 100 completed)

    Is there any simple way to do it since my application is very small and not feasible for a data whorehouse?

  10. Cracking piece Chris, Simple read through of the RangeStart and RangeEnd piece, copy and paste of the code from your GitHUB window and I have a self refreshing daily update coming from SalesForce giving me daily data going back 2 months so i can track trends, without having to invest in any costly add-ons to SalesForce.
    You have made a major contribution to Safety within my industry. Thank you.

  11. Hi Chris,

    Excellent post, thanks! I’ve successfully used it to scrape mortgage interest rates from a website on a daily basis (by converting the underlying html to xml, I was able to set up autorefresh for a webpage without the need for a Gateway).

    However, now I want to build upon this data to compare the data from yesterday with today to assess if there are any changes in the interest rate.

    Is there a proper way to do this without breaking the incremental refresh functionality?

    Thanks,

    Bas

  12. Hi Chris,

    As I ran in some refresh issues with the version I made, I downloaded your BBCNews PBIX and published it into the service with a daily scheduled refresh (without specifying an update time).

    It looked promising as the PBIX file in the desktop was empty, but after publishing it; it did contain data from that day (21-10-2020 11.37.57, this is the same time as the publication time). Today I would’ve expected similar lines, but then also for the 22-10 and possibly for the 23-10. However, it still shows only the data from 21-10. When doing a manual refresh in the service, it still only shows the data from 21-10. When opening the file in the desktop, I do get the current data when hitting refresh (without applying the last filter step in the query).

    Do you know what’s going wrong? I don’t understand why an exact copy of your PBIX doesnt incremental refresh when publishing it straight into the service.

    Thanks in advance.

    Rick

  13. Hi Chris,

    I have followed your blog and concept is working fine in Power BI Dataset. But i have requirement to implement the same in Power BI Dataflows and i tried but it did not work.

    Could you also let us know how we can implement the same concept in Power BI Dataflows ?

    Thanks in Advance,
    Shiva

  14. Hi Chris,

    Brilliant write up. This is exactly what I tried to do. As Rick – I downloaded your PBIX because I could not get the instructions to work on my own dataset. I tried publishing as is and got the same result as Rick. I tried various other things but cannot get it to store past data and add to the dataset on refresh.

    Could it be a tenant setting that prevents you or anything like that?

  15. Chris, this is nice article, but I’m curious…what about adding data to a dataset for something that is “snapshot like” that has nothing to do with date and time? i.e. what if you had a parameter table somewhere that you want to populate that triggers a dataflow based on those parameters and then just appends that data for the given Parm ID to the dataset. Dataflows themselves are technically always snapshot, but there doesn’t seem to be a good way to keep data from all of your snapshots so that you can go back and look at them easily over time. Any thoughts?

  16. Hi Chris, I downloaded your pbix file as well performed on my own dataset but the table seems to be blank and no data in it all ?

    What could be done ?

  17. Great post many thanks. I implemented the above with no issues. I have noticed that if manually refreshing from Power BI Desktop – it wipes out the history of all incremental refresh. My question – is it possible to keep the historic incremental refresh with all partitions and re-upload in case you need to apply changes to the pbix original file? many thank!

  18. Hello Chris this was a really good post. I have one question would it work to refresh every 5 minutes the last 24 hours? I have not been able to do it.

  19. We are now in the midst of going Live with multiple reports for multiple team and we are kind of thinking what’s the best way to implement the logic of incremental refresh within Dataflows when combining historical data ( via sharepoint file) and daily load data via API calls .

    Here are some of the key configurations from our side :
    a. We are currently on power bi premium subscription
    b. We would need to use dataflows (as all transformation logic is stored in dataflows across multiple entities to setup incremental refresh , as this dataflow would be used by multiple reports within individual workspaces catered to end users
    About data –
    a. Historical data : historical (one off activity) data is delivered by our DWH vendor through sftp which we are storing into SharePoint online and pointing our Power BI to this location
    b. Our daily data is to be loaded using API calls from our DWH vendor location
    c. Key thing to also note is that Daily data is only available on weekdays only (Mon – Fri) So basically run the refresh on Monday to get last friday’s close data
    d. Created a dataflow with two entities ( historical and Daily load)

    With all the above information – here are some of the things i have tried the below two options and noticed few issues in these . Reaching out to you to hear any better options to set this up .

    OPTION 1 – Set Daily refesh for last 4 or 5 days and append to historical table , so last 5 days data is kept updated incase of any backdataed transactions . Append the daily load to Historical load . 1st run works fine , then from Day 2 load – Day 1 data goes missing and Day 4 gets added .

    OPTION 2 – Refresh daily load and append to historical table – This option works well but in this case , i would need to pass a date parameter to pick data from DWH for the last Friday’s close and again , if i need to refresh twice a day (AM/ PM ) then this might not work and cause duplicate data when appending to history table
    Also to be is we are using Anonymous connection to connect to Web API

    Any guidance on this would be appreciated

  20. Hi,

    I have few queries on the incremental load to PowerBI.

    Issue 1
    So basically we choose 2 date columns, one is the business date on which the data is partitioned, another is incremental date columns which gets updated when the actual record is updated.
    So If i set the incremental option as ‘Refresh data in the last’ 1 month, and ‘Store data in the last’ 24 months – basically 24 months data based on business date is stored. Last 1 month partition is only checked for data updates.

    Is there someway to handle such things that i keep track of changes for older data also?

    Issue 2
    If my data is changed whose business date is 4 months back they are not refreshed.
    Secondly, even if one record in my last 1 month changes, it reload the entire month partition (which can have 100s or 1000s of records).

    Is there a solution to avoid this ?

    Issue 3
    Another scenario is, assume i store history in the table with start date and end date but report needs only the latest active record. In that case, the incremental date is always new as new history will be created on any change. So by this concept of Incremental, it may never work correctly.

    Is there a solution for this to be handled?

  21. Many tks, Chris! the tutorial about “Incremental Refresh” was very helpful in solving my issue. Awesome \m/

  22. The options for incremental refresh have changed.
    1. Archive data starting # days before refresh date
    2. incrementally refresh data starting # days before refresh date

    Can you advise what settings shouldbe applied, thanks.

  23. Hi Chris, many thanks for this, can I ask what would be the method for passing values to the RangeStart and RangeEnd variables on a day to day basis?

  24. Hi Chris – it appears that the September update at the beginning of your post has a broken link.

Leave a Reply to Peterson SoaresCancel reply