Chris Webb's BI Blog: Direct Lake

For any Power BI person, Direct Lake mode is the killer feature of Fabric. Import mode report performance (or near enough) direct on data from the lake, with none of the waiting around for data to refresh! It seems too good to be true. How can it be possible?

The full answer, going into detail about how data from Delta tables is transcoded to Power BI’s in-memory format, is too long for one blog post. But in part it is possible through something that existed before Fabric but which didn’t gain much attention: on-demand loading. There’s a blog post about it in Power BI Premium from December 2021 here:

https://powerbi.microsoft.com/en-us/blog/announcing-on-demand-loading-capabilities-for-large-models-in-power-bi/

TLDR; instead of loading all the data from the tables in your Lakehouse into memory when a query is run, on-demand loading means only the data that is needed for a query is loaded which naturally makes everything a lot faster. What’s more you can see what gets loaded into memory using DMVs. Here’s a simple example…

Let’s say you have a Lakehouse with a table in it called Sales that contains three columns: Country, Product and Sales.

I populated this table using a Dataflow gen2 using the following M expression:

let
  Source = #table(
    type table [Country = text, Product = text, Sales = number], 
    {
    {"UK", "Apples", 1}, {"France", "Apples", 2}, 
    {"UK", "Oranges", 5}, {"Germany", "Pears", 10}
    }
  )
in
  Source

Let’s also say you have a custom dataset built on this Lakehouse (you can’t use the default dataset, at least not yet, for what comes next) containing this table, called Sales Custom Dataset.

At this point you can open up DAX Studio, connect to your workspace’s XMLA Endpoint, and query this dataset. The DMV mentioned in the blog post above, DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS, tells you whether a column segment in a table can be paged in/out, if it is paged or out, the “temperature” (which represents the frequency that the column segment is accessed) and when the column segment was last accessed amongst other things. A column segment is a structure that holds compressed data for a column in the Vertipaq engine inside Power BI.

Running the following query:

Select
COLUMN_ID, SEGMENT_NUMBER, ISPAGEABLE, 
ISRESIDENT, TEMPERATURE, LAST_ACCESSED
from SYSTEMRESTRICTSCHEMA 
($System.DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS, 
[DATABASE_NAME] = 'Sales Custom Dataset')

Immediately after creating the dataset or refreshing it (again, what refresh means for a Direct Lake dataset is something for another post but it has the side effect of paging everything out of memory) will give the following result:

Note that for this simple dataset there is only one row returned per column – this will not be the case for datasets with more data, or depending on how the data is loaded into the lake, where there will be more rows because there are more segments. Each row represents a column segment, not a column. Also note that apart from the RowNumber column (which is hidden and always there) every column segment is pageable, is not resident in memory, and has no value for Temperature or Last Accessed.

Now let’s say that you build a report on this custom dataset with a single table visual using just the Country and Sales column, like so:

Rerunning the DMV query above now returns the following:

You can now see that the Country and Sales columns are resident in memory, have a fairly “hot” temperature, and you can see the date and time they were last accessed.

Doing nothing else apart from waiting about five minutes and then rerunning the same DMV returns the following:

You can see that Country and Sales are still in memory but their temperature has reduced.

Adding the Product column to the table visual like so:

…and then rerunning the DMV query now gives the following results:

As you might expect, the only column segment for Product has now been paged into memory but it still has a lower temperature than the column segments for Country and Sales.

As well as looking at column segments, it’s also possible to do the same thing for column dictionaries by running the following DMV query:

Select
COLUMN_ID, DICTIONARY_SIZE, DICTIONARY_ISPAGEABLE,
DICTIONARY_ISRESIDENT, DICTIONARY_TEMPERATURE, DICTIONARY_LAST_ACCESSED
from SYSTEMRESTRICTSCHEMA 
($System.DISCOVER_STORAGE_TABLE_COLUMNS, 
[DATABASE_NAME] = 'Sales Custom Dataset')

Here’s an example of the output:

Finally, after leaving the dataset for two days and not running any other queries or opening any reports, rerunning both DMVs shows that everything has been paged out of memory again:

OK, so this is all very interesting for nerds like me who like to see how things work behind the scenes. As I said, though, it’s something that has happened for Large Datasets in Import mode for a while now; if you’re thinking that it would be cool to build a report using these DMVs and use the Temperature column to see which columns are the most frequently used, Gilbert Quevauvilliers already did that here.

There is one big difference between Import mode and Direct Lake mode to point out though. When you do a full refresh on an Import mode dataset the whole dataset has to fit in memory, and you are therefore at the mercy of the limits imposed on the amount of memory a dataset can use in either Shared capacity or the Premium capacity you are using. These same limits exist for Direct Lake datasets, but since refresh for Direct Lake datasets is not the same thing as refresh for Import datasets, at no point during a refresh does the whole of the dataset need to be in memory. This means the memory limits only apply to how much of the dataset can be paged into memory at any one time, so in some cases you’ll be able to work with much larger datasets than are possible in Import so long as your queries only need to read a small part of the data at a time. I say “in some cases” because it’s complicated: there are various other, as yet undocumented, rules about whether a Direct Lake dataset can be queried as such or whether queries fall back to Direct Query mode and some of these rules do relate to the data volumes used. The point is that as Power BI developers the way we think about dataset memory usage will have to change with Direct Lake mode.

Category: Direct Lake

On-Demand Loading Of Direct Lake Power BI Datasets In Fabric

Like this:

Share this:

Like this: