If you’ve heard about the new Direct Lake mode for Power BI datasets in Fabric you’ll know that it gives you the query performance of Import mode (well, almost) without the need to actually import any data. Direct Lake datasets can be refreshed though – in fact, they refresh automatically by default – and if you look at the dataset’s Refresh History you’ll see there’s a Direct Lake section which sometimes shows errors:
Also, if you look at a custom dataset’s Settings page (although not yet for a default dataset) you’ll see some properties to control refresh too:
So what does it actually mean to refresh a Direct Lake dataset if it doesn’t involve loading data into the dataset?
The section on Direct Lake refresh in the Fabric docs has the following information:
Invoking a refresh for a Direct Lake dataset is a low cost operation where the dataset analyzes the metadata of the latest version of the Delta Lake table and is updated to reference the latest files in the OneLake.
Let’s see what this means using a simple example. I built a Dataflow Gen2 that loads a single row of data into a table in a Fabric Lakehouse with two columns: one called Sales that always contains the value 1 and one called LoadDate that contains the date and time the dataflow ran:
I ran the dataflow once to load a row of data into a table called MyTable in a Lakehouse:
I then built a custom dataset (because I want to change those refresh options mentioned above) consisting of just this table, and finally a report showing the contents of the table in the dataset:
I then connected SQL Server Profiler to the custom dataset via the XMLA Endpoint and started a trace to capture the Command Begin/End and Progress Report Begin/End events, and refreshed the dataflow (and only the dataflow) to load another row of data into the Lakehouse. Soon after the dataflow refresh finished, the Profiler trace showed a dataset refresh started automatically:
Refreshing the report showed the second row that had just been loaded:
This shows that, with the default settings, a Direct Lake dataset is automatically refreshed when data is loaded into a Lakehouse.
I then went to the Settings pane for the dataset and turned off the “Keep your Direct Lake data up to date” property:
I then ran the dataflow again and this time the Profiler trace showed that no automatic refresh took place; the new row was not shown in the report either. Manually refreshing the dataset from the workspace did result in the new row appearing in the report:
Next, I used a Notebook to delete all the rows from the table:
At this point the report still showed the three rows displayed in the previous screenshot. Finally, I refreshed the dataset one more time and all the data disappeared from the report:
It’s important to stress that the reason that the refresh is needed to show the latest data in the table is not because the data is being loaded into the dataset. It’s because, as the docs say, refresh tells the dataset to look at the latest version of the data in the table – which leads on to the whole topic of time travel in Delta tables in Fabric. Dennes Torres has a nice blog post on this subject here which is a great place to start.
Why would you ever want to refresh a Direct Lake dataset manually? Again, the docs have the answer:
You may want to disable [refresh] if, for example, you need to allow completion of data preparation jobs before exposing any new data to consumers of the dataset.
So, let’s say you need to load some new data to your table and also delete or update some data that’s already there and this needs to be done as several separate jobs. It’s very likely that you don’t want your Power BI reports to show any of the new data until all of these jobs have been completed, so to do this you will want to disable automatic refresh and do a manual dataset refresh as the last step of your ETL.