What Happens When Power BI Direct Lake Semantic Models Hit Guardrails?

Direct Lake mode in Power BI allows you to build semantic models on very large volumes of data, but because it is still an in-memory database engine there are limits on how much data it can work with. As a result it has rules – called guardrails – that it uses to check whether you are trying to build a semantic model that is too large. But what happens when you hit those guardrails? This week one of my colleagues, Gaurav Agarwal, showed me the results of some tests that he did which I thought I would share here.

Before I do that though, a bit more detail about what these guardrails are. They are documented in the table here, they vary by Fabric capacity SKU size and there are four of them which are limits on:

  • The number of Parquet files per Delta table
  • The number of rowgroups per Delta table
  • The number of rows per table
  • The total size of the data used by the semantic model on disk

There is also a limit on the amount of memory that can be used by a semantic model, something I have blogged about extensively, but technically that’s not a guardrail.

Remember also that there are two types of Direct Lake mode (documented here): the original Direct Lake mode called Direct Lake on SQL Endpoints that I will refer to as DL/SQL and which has the ability to fall back to DirectQuery mode, and the newer version called Direct Lake on OneLake that I will refer to as DL/OL and which cannot fall back to DirectQuery.

For his tests, Guarav built a Fabric Warehouse containing a single table. He then added more and more rows to this table to see how a Direct Lake semantic model built on this Warehouse behaved. Here’s what he found.

If you build a DL/SQL model that exceeds one of the guardrails for the capacity SKU that you are using then, when you refresh that model, the refresh will succeed and you will see the following warning message in the model Refresh History:

We noticed that the source Delta tables exceed the resource limits of the Premium or Fabric capacity requiring queries to fallback to DirectQuery mode. Ensure that the Delta tables do not exceed the capacity's guardrails for best query perf.

This means that even though the refresh has succeeded, because the model has exceeded one of the guardrails then it will always fall back to DirectQuery mode – with all the associated performance implications.

If your DL/SQL model exceeds one of the guardrails for the largest Fabric SKU, an F2048, then refresh will fail but you will still be able to query the model and the model will again fall back to DirectQuery mode. For his tests, Guarav loaded 52 billion rows into a table; the guardrail for the maximum number of rows in a table for an F2048 is 24 billion rows. The top-level message you get when you refresh in this case is simply:

An error occurred while processing the semantic model.

Although if you look at the details you’ll see a more helpful message:

We cannot refresh the semantic model because of a Delta table issue that causes framing to fail. The source Delta table '<oii>billionrows</oii>' has too many parquet files, which exceeds the maximum guardrails. Please optimize the Delta table. See https://go.microsoft.com/fwlink/?linkid=2316800 for guardrail details.

The DAX TableTraits() function, which another colleague, Sandeep Pawar, blogged about here, can also tell you the reason why a DL/SQL semantic model is falling back to DirectQuery mode. Running the following DAX query on the 52 billion row model:

EVALUATE TABLETRAITS()

…returned the following results:

This shows that the table called billionrows actually exceeds the guardrails for the number of files, the number of rowgroups and the number of rows.

What about DL/OL models though? Since they cannot fall back to DirectQuery mode, when you try to build or refresh a DL/OL semantic model that exceeds a guardrail you’ll get an error and you won’t be able to query your semantic model at all. For example here’s what I saw in Power BI Desktop when I tried to use the 52 billion row table in a DL/OL model:

Something went wrong connecting to this item. You can open the item in your browser to see if there is an issue or try connecting again.
We cannot refresh the semantic model because of a Delta table issue that causes framing to fail. The source Delta table 'billionrows' has too many parquet files, which exceeds the maximum guardrails. Please optimize the Delta table. 

All of this behaviour makes sense if you think about it, even though I wouldn’t have known how things work exactly until I had seen it. Some behaviour may change in the future to make it more intuitive; if that happens I will update this post.

[Thanks to Gaurav for showing me all this – check out his podcast and the India Fabric Analytics and AI user group that he helps run on LinkedIn. Thanks also to Akshai Mirchandani and Phil Seamark for their help]

Performance Testing Power BI Direct Lake Models Revisited: Ensuring Worst-Case Performance

Two years ago I wrote a detailed post on how to do performance testing for Direct Lake semantic models. In that post I talked about how important it is to run worst-case scenario tests to see how your model performs when there is no model data present in memory, and how it was possible to clear all the data held in memory by doing a full refresh of the semantic model. Recently, however, a long-awaited performance improvement for Direct Lake has been released which means a full semantic model refresh may no longer page all data out of memory – which is great, but which also makes running performance tests a bit more complicated.

First of all, what is this new improvement? It’s called Incremental Framing and you can read about it in the docs here. Basically, instead of clearing all data out of memory when you do a full refresh of a Direct Lake model, the model now checks each Delta table it uses to see whether the data in it has actually changed. If it hasn’t changed then there’s no need to clear any data from that table out of memory. Since there’s a performance overhead to loading data into memory when a query runs this means that you’re less likely to encounter this overhead, and queries (especially for models where the data in some tables changes frequently) will be faster overall. I strongly recommend you to read the entire docs page carefully though, not only because it contains a lot of other useful information, but also because you might be loading data into your lakehouses in a way that prevents this optimisation from working.

Let me show you an example of this by revisiting a demo from a session I’ve done at several user groups and conferences on Power BI model memory usage (there are several recordings of it available, such as this one). Using a Direct Lake semantic model consisting of a single large table with 20 columns containing random numbers, if I use DAX Studio’s Model Metrics feature when there is no data held in memory and with the Direct Lake Behaviour setting in DAX Studio’s Options dialog set to ResidentOnly (to stop Model Metrics from loading data from all columns into memory when it runs):

Then when you run Model Metrics the size of each column in the semantic model is negligible and the Temperature and Last Accessed for all model columns are blank:

The, if I run a query that asks for data from just one column (in this case the column called “1”) from this table like this:

EVALUATE ROW("Test", DISTINCTCOUNT('SourceData'[1]))

Then rerun Model Metrics then the size in memory for that column changes, because of course it has been loaded into memory in order to run the query:

Zooming in on the Model Metrics table columns from the previous screenshot that show the size in memory:

And here are the Temperature and Last Accessed columns from the same screenshot which are no longer blank:

Since the query had to bring the column into memory before it could run, the DAX query took around 5.3 seconds. Running the same query after that, even after using the Clear Cache button in DAX Studio, took about only 0.8 seconds because the data needed for the query was already resident in memory.

OK, so far nothing has changed in terms of behaviour. However if you do a full refresh from the Power BI UI without making any changes to the underlying Delta tables:

And then rerun the Model Metrics, nothing changes and the data is still in memory! As a result the DAX query above still only takes about 0.8 seconds.

So how do you get that worst-case performance again? As mentioned in the docs here, you now need to do a refresh of type clearValues followed by a full refresh. You can’t do a refresh of type clearValues in the Power BI UI though, so the easiest way to do is to use a Fabric notebook and Semantic Link Labs. Here’s how. First install Semantic Link Labs:

%pip install semantic-link-labs

Then use the following code in a notebook cell to do a refresh of type clearValues followed by a full refresh:

import sempy_labs as labs
WorkspaceName = "Insert Workspace Name Here"
SemanticModelName = "Insert Semantic Model Name Here"
# run a refresh of type clearValues first
labs.refresh_semantic_model(dataset=SemanticModelName, workspace=WorkspaceName, refresh_type="clearValues")
# then a refresh of type full
labs.refresh_semantic_model(dataset=SemanticModelName, workspace=WorkspaceName, refresh_type="full")

After doing this on my model, Model Metrics shows that the column called “1” that was previously in memory is no longer resident:

…and the query above once again takes 5 seconds to run.

So, as you can see, if you’re doing performance testing of a Direct Lake model you now need to make sure you do a refresh of type clearValues and a full refresh of your model before each test to ensure no data is resident in memory and get worst-case performance readings, in addition to testing performance on a cold cache and a warm cache.

Changing The Source Lakehouse Of Power BI Direct Lake Models In Deployment Pipelines

If you’re using deployment pipelines with Direct Lake semantic models in Power BI you’ll have found that when you deploy your model from one stage to another by default the model still points to the Lakehouse it was originally bound to. So, for example, if you deploy your model from your Development stage to your test stage, the model in the Test stage still points to the Lakehouse in the Development stage. The good news is that you can use the deployment rules feature of deployment pipelines to make sure the model in the Test stage points to a Lakehouse in the Test stage and in this post I’ll show you how.

To illustrate how to do this I created a workspace to represent a dev environment with a Lakehouse containing one table, called MyTable, containing the following data:

I then created a second workspace to represent a test environment with a Lakehouse containing a table with the same name and the same schema but with different data:

Then, back in the dev workspace I created a custom semantic model pointing to the mydata table and built a report on it to show the data:

Here’s what the dev workspace looked like:

I then created a deployment pipeline, assigned the two workspaces to the Development and Test stages of it, and configured it so that the semantic model and report would be deployed from Development to Test:

I then did a deployment and, as mentioned, although the report and the semantic model were moved to the Test stage workspace, the model still pointed to the Lakehouse in the Development stage workspace so the report showed the same data as before:

To fix this I created a deployment rule using the “Data source rules” option. This is where things get tricky. When you create the rule you can see that the custom model on the From side has a Database and a Server property to know which Lakehouse it is pointing to:

This means that on the To side you need to choose the “Other” option in the dropdown box and enter values for the Database and Server:

What do these properties mean?

  • The Database property is the ID of the SQL Endpoint of the Lakehouse the model is pointing to and you can find it by opening the SQL Endpoint UI of the Lakehouse in the browser and copying it from the part of the URL that immediately follows “lakehouses”:
  • The Server property is the SQL connection string of the SQL Endpoint, which can be found by clicking the Gear icon on the menu in the SQL Endpoint UI to open the Settings pane:

I changed the settings in the data source rule to the values from the Lakehouse in the Test workspace and then redeployed. After doing that the report in my Test workspace showed the data from the Lakehouse in the Test workspace (note: I had to manually refresh the report first to do this to flush out stale data from the cache):

Finding The Size Of All Of The Columns Touched By A DAX Query

I had meant to follow up my recent post on how to find the columns touched by a DAX query by writing one on how to use this technique to find the size of these columns in memory, so you can find the total size of the columns that need to be paged into memory when a DAX query runs on a Direct Lake semantic model. Before I could do that, though, my colleague Michael Kovalsky messaged me to say that not only had he taken the query from that first post and incorporated it in Semantic Link Labs, he’d done the work to get column sizes too. All that’s left for me to do, then, is give you some simple examples of how to use it.

To use Semantic Link Labs you just need to create a new Fabric notebook and install the library:

%pip install semantic-link-labs

After that you can use sempy_labs.get_dax_query_dependencies to get the columns touched by any DAX query, for example:

import sempy_labs as labs
labs.get_dax_query_dependencies(
        dataset = 'InsertSemanticModelName',
        workspace = 'InsertWorkspaceName',
        dax_string = "InsertDAXQuery",
    )

This returns a dataframe with one row for each column touched by the query, plus various statistics about the size of each column in memory.

If you’re working with a Direct Lake semantic model, though, in order to get the correct sizes of each column in memory the query itself will need to have been run beforehand; you can ensure that this happens by setting the optional parameter put_in_memory to True:

import sempy_labs as labs
labs.get_dax_query_dependencies(
        dataset = 'InsertSemanticModelName',
        workspace = 'InsertWorkspaceName',
        dax_string = "InsertDAXQuery",
        put_in_memory = True
    )

Last of all, if you don’t want a dataframe but just want a single number representing the total memory needed by all columns touched by a query, you can use sempy_labs.get_dax_query_memory_size, for example like this:

import sempy_labs as labs
labs.get_dax_query_memory_size(
        dataset = 'InsertSemanticModelName',
        workspace = 'InsertWorkspaceName',
        dax_string = "InsertDAXQuery"
    )

Yet more evidence that, for any Power BI user, Semantic Link and Semantic Link Labs are the best reasons for you to flip the switch to enable Fabric. To find out more about what they are capable check out this user group presentation.

Role-Playing Dimensions In Fabric Direct Lake Semantic Models

Did you know you can add the same physical table in OneLake multiple times to the same Direct Lake semantic model?

Let’s say you have two tables in a Fabric Lakehouse. One is a fact table called Sales:

…and the other is a dimension table called Date:

Note that the Sales fact table has two date columns, OrderDate and ShipDate.

If you create a DirectLake semantic model using the Web Editor and add these two tables you could rename the Date table to Order Date and build a relationship between it and the OrderDate column on the Sales table:

What about analysing by Ship Date though? You could create a physical copy of the Date table in your Lakehouse and add that to the model, but there’s another option.

If you connect to the model using a tool like Tabular Editor, duplicate the Order Date table and rename the new table Ship Date:

You then have two tables in your semantic model connected to the same physical table in your Lakehouse, and you can create a relationship between this new table and the Sales table:

…and then use the new dimension in your reports:

This is handy for handling role-playing dimensions, which are most often Date dimensions but may be other types of dimension too. Indeed, I’ve sometimes found the need to add the same fact table to a model more than once. The benefit of only having one physical copy is reduced refresh time, lower storage costs (although for most role playing dimensions the savings will be negligible) and simpler ETL.

Which Columns Are Loaded Into Memory When In Power BI Direct Lake Mode?

As you probably know, in Power BI Direct Lake mode column data is only loaded into memory when it is needed by a query. I gave a few examples of this – and how to monitor it using DMVs – in this blog post from last year. But which columns are loaded into memory in which circumstances? I was thinking about this recently and realised I didn’t know for sure, so I decided to do some tests. Some of the results were obvious, some were a surprise.

Test semantic model and methodology

For my tests I loaded the following tables of data into a Fabric Lakehouse:

…and created a Direct Lake custom semantic model that looked like this:

Before every test I refreshed the model to make sure there was no data in memory (be aware that in the future it may be that refreshing a model does not purge all column data from memory). Then, for each test, I ran a single DAX query from DAX Studio and afterwards ran the following query to see whether the dictionaries for each column were resident in memory (again, see my previous post for background):

EVALUATE 
SELECTCOLUMNS(
INFO.STORAGETABLECOLUMNS(),
"TABLE_ID", [TABLE_ID],
"COLUMN_ID", [COLUMN_ID],
"DICTIONARY_ISRESIDENT", [DICTIONARY_ISRESIDENT])

Note that I’m using one of the new DAX Info functions, INFO.STORAGETABLECOLUMNS(), instead of the older DMV syntax I was using last year; I think the DAX Info functions are a lot more convenient to use.

[I also looked at whether column segments were paged into memory using the INFO.STORAGETABLECOLUMNSEGMENTS() but found that its results were consistent with INFO.STORAGETABLECOLUMNS() so I didn’t include its results in this post and assumed that if a column’s dictionary was resident in memory, so were the associated column segments. It’s also worth mentioning that the latest versions of DAX Studio have some great new functionality in the Model Metrics for telling you which columns are resident in memory in a Direct Lake model]

Here’s what the query above returned immediately after a refresh, before any other queries had been run:

The DICTIONARY_ISRESIDENT column tells you whether a column’s dictionary is resident in memory. Each table in a Direct Lake model (as in an Import mode model) has a hidden column called RowNumber, and for a Direct Lake model this column is always resident in memory. As you can see, all other columns are not resident in memory at this point.

Counting rows in a table

For my first test I created a measure that counted the rows in the Sales table:

Order Count = COUNTROWS('Sales')

I then ran a query that returned just the result of this measure:

EVALUATE
CALCULATETABLE(
    ROW(
    "Order Count", [Order Count]
    )
)

I then ran my query to see what had been loaded into memory, and…

…nothing had changed! No new column data had been loaded into memory at all. I assume this is because Power BI can resolve this query using the RowNumber column from the Sales table.

I then added the Order_Status column from the Sales table to the query:

EVALUATE
SUMMARIZECOLUMNS(
    Sales[Order_Status],
    "Order Count", [Order Count]
)
ORDER BY 
    Sales[Order_Status] ASC

After this query, the Order_Status column was resident in memory as you might expect, but no other column was:

Relationships

Instead of using the Order_Status column, I then looked at the impact of using a column from a different table. I created a query that showed Order Count by Customer Name:

EVALUATE
SUMMARIZECOLUMNS(
    Customer[Customer_Name],
    "Order Count", [Order Count]
)
ORDER BY 
    Customer[Customer_Name] ASC

After this, the Customer_Name column was in memory along with the two CustomerID columns used in the relationship between the Customer and Sales table:

Any time your query references columns in different tables, the columns used in the relationships between those tables must also be resident in memory.

Measures

I then defined another measure:

Sales Amount = SUM(Sales[Sales_Amount])

And ran the following query:

EVALUATE
SUMMARIZECOLUMNS(
    Sales[Order_Status],
    "Sales Amount", [Sales Amount]
)
ORDER BY 
    Sales[Order_Status] ASC

After this, the Order_Status and Sales_Amount columns were resident in memory:

No surprises here: as you would expect, if a column is referenced by a measure then it needs to be resident in memory.

Measures that use relationships also work as expected. I created the following measure:

Order Count Citrus = 
CALCULATE([Order Count], 'Product'[Category]="Citrus")

…and the following query:

EVALUATE
SUMMARIZECOLUMNS(
    Customer[Customer_Name],
    "Order Count Citrus", [Order Count Citrus]
)
ORDER BY 
    Customer[Customer_Name] ASC

After this query, all the columns used in the measure and all the relationships needed by the query and the measure were resident in memory:

I was curious to know what the effect of efficient versus inefficent DAX in measures might be on which columns were loaded into memory though. I created the following two measures to see the impact of the antipattern of filtering on an entire table in the second parameter of Calculate():

Order Count Status X V1 = 
CALCULATE([Order Count], 'Sales'[Order_Status]="X")

Order Count Status X V2 = 
CALCULATE([Order Count], FILTER('Sales', 'Sales'[Order_Status]="X"))

My first query used the first of these measures, the efficient version:

EVALUATE
SUMMARIZECOLUMNS(
    Sales[Order_Status],
    "Order Count Status X V1", [Order Count Status X V1]
)
ORDER BY 
    Sales[Order_Status] ASC

After this query the Customer_Name column, the two Customer_ID columns and the Order_Status column were resident in memory:

But what about the second, inefficient version of the measure?

EVALUATE
SUMMARIZECOLUMNS(
    Customer[Customer_Name],
    "Order Count Status X V2", [Order Count Status X V2]
)
ORDER BY 
    Customer[Customer_Name] ASC

The same columns were resident in memory:

I don’t think this means this version of the measure is any less efficient than it is in Import mode, just that the two versions of the measure need the same columns to be resident in memory to run.

Conclusion

Knowing which columns your query needs to have resident in memory is important for two reasons in Direct Lake mode: it helps understand query performance, because loading column data into memory takes time (Teo Lachev published some test results in a blog post a few days ago which show this clearly); it also helps you understand model memory usage and where you stand relative to the memory limits for the capacity SKU you’re using. As these results show it’s better to test to see which columns your Power BI reports need to have resident in memory rather than rely on guesswork.

It is also possible to use the DISCOVER_CALC_DEPENDENCIES DMV (or the INFO.CALCDEPENDENCY function) to see which tables, columns, measures and relationships are referenced by a DAX query as I described here; I’ve been playing around with this for Direct Lake too, and will report my findings in a future blog post.

Power BI Semantic Model Memory Errors, Part 1: Model Size

You probably know that semantic models in Power BI can use a fixed amount of memory. This is true of all types of semantic model – Import, Direct Lake and DirectQuery – but it’s not something you usually need to worry about for DirectQuery mode. The amount of memory they can use depends on whether you’re using Shared (aka Pro) or a Premium/Fabric capacity, and if you’re using a capacity how large that capacity is. In Shared/Pro the maximum amount of memory that a semantic model can use is 1GB; if you are using a capacity then the amount of memory available for models in each SKU is documented in the table here in the Max Memory column:

What counts as “memory usage” though? More importantly, how can you breach this limit and what do all of the different memory-related error messages that you might see mean? In this series I will try to answer these questions, and in this post I will look at one particular error you see when your model needs to use more memory than it is allowed to.

First of all it’s important to understand that the amount of memory used by a semantic model is not the same as the amount of data “in” the model. The diagram below shows how model memory usage can be broken down. The data in the columns and tables of your model, along with supporting objects like relationships (represented by the blue box in the diagram below) makes up just one part of the overall model memory usage. In addition, more memory is needed to store data associated with row-level security, user sessions, caches and so on (represented by the orange box in the diagram below).

Both Import mode and Direct Lake models can page data in and out of memory as required, so the whole model may not be in memory at any given time. However, in order for a query to run, the data it needs must be in memory and cannot be paged out until the query has finished with it. Therefore out of all the memory consumed by a semantic model, at any given time, some of that memory is “evictable” because it isn’t in use while some of it is “non-evictable” because it is being used. Evictable memory may be paged out of memory for a variety of reasons, for example because the model is nearing its allowed memory limit.

Queries that are running on the model (the purple boxes in the diagram above) also consume memory. Each query has a limit on the amount of memory it can use – I mentioned the Query Memory Limit in this post but I will revisit it later on in this series – but the memory used by queries does not contribute directly to the overall memory use of a semantic model. However a query that is running will force parts of the model to be in memory for a certain amount of time, and this memory will be non-evictable while in use.

In summary then, the total amount of memory used by a semantic model is made up of two groups:

  1. The data in the tables in your model (the blue box above)
  2. Supporting data for RLS security roles, sessions and caches (the orange box above)

When the sum of these two groups exceeds the total amount of memory allowed for your model, and no data can be evicted from memory to reduce this sum, then you’ll get an error.

To illustrate this I created a new F2 capacity, which has a 3GB limit on the amount of memory used by a semantic model, loaded a table (called SourceData) with 3.5 million rows of random numbers stored as text into a Lakehouse, then created a new custom Direct Lake semantic model on it. I set the Direct Lake Behavior property on the model to “Direct Lake only” to prevent fallback to DirectQuery mode.

After creating the model I used DAX Studio’s Model Metrics feature with the “Read statistics from data” option turned off to find the amount of data stored in memory (ie the blue box value).

Unsurprisingly, at this stage, the size of the model was very small: only 8KB.

I then turned the “Read statistics from data” option on, knowing that this would force data to be paged into memory. This showed the total potential size of the model to be 4.25GB:

I was initially confused by this because this is already well over the 3GB limit, but it was pointed out to me that what is probably happening is that DAX Studio runs a number of DMV queries to get the data needed to calculate this value and when this happens different parts of the model are paged in and out of memory. It was certainly very slow for DAX Studio to calculate the Model Metrics when I did this which fits with the paging in/out theory.

Finally, I ran a simple DAX query to get the top 10 rows from the SourceData table:

EVALUATE TOPN(10, SourceData)

This query ran for about ten seconds and then failed with the following error message:

Resource Governing: We cannot complete the requested operation because there isn’t enough memory (consumed memory 4620 MB, memory limit 3072 MB). Either reduce the size of your dataset, such as by limiting the amount of in-memory data, or host the dataset on a Fabric or Premium capacity with a sufficient memory size. See https://go.microsoft.com/fwlink/?linkid=2159753 to learn more.

[The error code associated with this message is 0xC13E0006 or -1052901370]

This is the error that you get when your model needs to use more memory than it is allowed to use for the capacity SKU it is running on. The query references every column from the only table in the model, which means the whole table – which is the whole model – would have to be paged in to memory for the query to run, but the whole model requires more memory than is available on an F2 capacity.

If you aren’t getting this exact error message then something slightly different might be happening. In future posts in this series I will look at some of these other errors including the query memory limit and the command memory limit.

[Thanks to Marius Dumitru and Akshai Mirchandani for the information in this post]

Update: Now this series is concluded, here are the other posts in this series:

Exploring Power BI Run-Length Encoding With DMVs

Recently I was involved in an interesting discussion on Twitter X about how partitioning a table in a Power BI dataset semantic model can affect compression and therefore its size and query performance. You can read the thread here. It got me thinking: is there a way to get more detail on how well compression, and in particular run-length encoding (RLE from hereon), is working for a column when you’re using Import mode or Direct Lake? After a bit of research I found out there is, so let’s see some examples that illustrate what I learned.

First of all, consider the following M query that returns a table with one numeric column called MyNumbers:

let
    Source = {1..DistinctValues},
    Repeat = List.Transform(Source, each List.Repeat({_}, TotalRows/DistinctValues)),
    Combine = List.Combine(Repeat),
    #"Converted to Table" = Table.FromList(Combine, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "MyNumbers"}}),
    #"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"MyNumbers", Int64.Type}})
in
    #"Changed Type"

It references two M parameters: TotalRows, which determines the total number of rows in the table, and DistinctValues, which determines the number of distinct numeric values in the MyNumbers column. With TotalRows=9 and DistinctValues=3, it returns the following output:

Note that in this case it returns 3 rows with the value 1 repeated, 3 rows with the value 2 repeated and 3 three rows with the value 3 repeated; three sets of repeated values in all. It’s fair to assume that repeated sets of values like this are a good candidate for RLE.

I created a semantic model in Import mode containing only this table and published it to the Power BI Service. Initially TotalRows was set to 1,000,000 and DistinctValues was set to 100 – so the table consisted of just 100 sets of 10,000 repeated values. I chose 1,000,000 rows because that’s the size of a segment in the Power BI Service with the “small semantic model format” setting and any compression that takes place always takes place within a segment.

When the Analysis Services engine inside Power BI compresses data it looks for sequences of repeated values to see if RLE can be used. If it finds them, these sequences result in “pure” RLE runs; if it doesn’t find these sequences they are called “impure” RLE runs and the values are stored using bitpack compression. Pure runs are generally a good thing, impure runs generally a bad thing. You can see how many pure and impure runs there are using the TMSCHEMA_COLUMN_STORAGES DMV, for example with the following DMV query:

select 
[Name], Statistics_DistinctStates, Statistics_RowCount, 
Statistics_RLERuns, Statistics_OthersRLERuns 
from $SYSTEM.TMSCHEMA_COLUMN_STORAGES

Running this query in DAX Studio on my published semantic model returned the following table:

[You can ignore all the rows except the one for the MyNumbers column in this table]

The Statistics_RLERuns column shows the number of pure RLE runs; the Statistics_OthersRLERuns column shows the number of impure RLE runs. In this case you can see, for the MyNumbers column, there were 100 pure RLE runs and no impure runs, so as expected RLE is working well.

Here’s what Vertipaq Analyzer showed for this table:

Unsurprisingly the size of the MyNumbers column is very small.

Then I changed DistinctValues to 100,000 (keeping TotalRows at 1,000,000), giving me 100,000 sets of 10 values, and refreshed the dataset. Here’s what the DMV query on TMSCHEMA_COLUMN_STORAGES returned:

And here’s what Vertipaq Analyzer showed:

As you can see, the column was a lot larger than before; there were no pure RLE runs and one impure RLE run. In this case the large number of distinct values in the column prevented RLE from taking place and this had a negative impact on the size of the column.

These are two extreme cases. What about a scenario that’s somewhere in between? I modified my M query as follows:

let  
    RepeatedNumbers = 
    let
    Source = {1..DistinctValues},
    Repeat = List.Transform(Source, each List.Repeat({_}, ((TotalRows/2)/DistinctValues))),
    Combine = List.Combine(Repeat),
    #"Converted to Table" = Table.FromList(Combine, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "MyNumbers"}}),
    #"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"MyNumbers", Int64.Type}})
in
    #"Changed Type",

    RandomNumbers = 
    let
    Source = {1..TotalRows/2},
    #"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Added Custom" = Table.AddColumn(#"Converted to Table", "MyNumbers", each Number.Round(Number.RandomBetween(TotalRows+1, TotalRows*2))),
    #"Removed Other Columns" = Table.SelectColumns(#"Added Custom",{"MyNumbers"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Removed Other Columns",{{"MyNumbers", Int64.Type}})
in
    #"Changed Type",

    Output = Table.Combine({RepeatedNumbers, RandomNumbers})
in
    Output

What this version of the code does is return a table where the first 50% of the rows are repeated numbers and the second 50% are random numbers. With TotalRows set to 12 and DistinctValues set to 2 it produces the following output:

With this version published to the Power BI Service I set TotalRows to 1,000,000 again and set DistinctValues to 2000, resulting in a table with 2000 sets of 250 repeating values followed by 500,000 random values. Here’s what the DMV query against TMSCHEMA_COLUMN_STORAGES returned:

As you can see there are now 2000 pure runs (I assume for the first 50% of rows with repeated values) and 1 impure run (I assume for the second 50% of rows with random values).

Here’s the output of Vertipaq Analyzer:

The column is now almost as large as in the second scenario above.

You can get a bit more detail about what’s happening in the impure runs with the DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS DMV. Running the following query against the latest version of the table:

select 
column_ID, partition_name, segment_number,
records_count, bits_count, used_size
from $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS

…returns the following values:

To get a rough idea of the number of rows in the impure runs you can use the following formula:

(used_size * 8)/bits_count

In this case for the MyNumbers column (1349720 * 8)/21 = 514,179 which makes sense since my code returns 500,000 rows of random numbers. The records_count column in this query returns the total number of rows in the segment, so the higher the numberof rows in impure runs relative to the total, the worse compression you’re getting.

What practical use is this information? Probably not much as you might think, interesting as it is. It can tell you how well RLE is working for a column but it doesn’t tell you much about how to optimise it, or if it is possible to optimise it, or if optimising it is a good idea – that’s a subject for another blog post.

[Thanks to Marius Dumitru and Akshai Mirchandani for the information in this post]

What Does It Mean To Refresh A Direct Lake Power BI Dataset In Fabric?

If you’ve heard about the new Direct Lake mode for Power BI datasets in Fabric you’ll know that it gives you the query performance of Import mode (well, almost) without the need to actually import any data. Direct Lake datasets can be refreshed though – in fact, they refresh automatically by default – and if you look at the dataset’s Refresh History you’ll see there’s a Direct Lake section which sometimes shows errors:

Also, if you look at a custom dataset’s Settings page (although not yet for a default dataset) you’ll see some properties to control refresh too:

So what does it actually mean to refresh a Direct Lake dataset if it doesn’t involve loading data into the dataset?

The section on Direct Lake refresh in the Fabric docs has the following information:

Invoking a refresh for a Direct Lake dataset is a low cost operation where the dataset analyzes the metadata of the latest version of the Delta Lake table and is updated to reference the latest files in the OneLake.

Let’s see what this means using a simple example. I built a Dataflow Gen2 that loads a single row of data into a table in a Fabric Lakehouse with two columns: one called Sales that always contains the value 1 and one called LoadDate that contains the date and time the dataflow ran:

I ran the dataflow once to load a row of data into a table called MyTable in a Lakehouse:

I then built a custom dataset (because I want to change those refresh options mentioned above) consisting of just this table, and finally a report showing the contents of the table in the dataset:

I then connected SQL Server Profiler to the custom dataset via the XMLA Endpoint and started a trace to capture the Command Begin/End and Progress Report Begin/End events, and refreshed the dataflow (and only the dataflow) to load another row of data into the Lakehouse. Soon after the dataflow refresh finished, the Profiler trace showed a dataset refresh started automatically:

Refreshing the report showed the second row that had just been loaded:

This shows that, with the default settings, a Direct Lake dataset is automatically refreshed when data is loaded into a Lakehouse.

I then went to the Settings pane for the dataset and turned off the “Keep your Direct Lake data up to date” property:

I then ran the dataflow again and this time the Profiler trace showed that no automatic refresh took place; the new row was not shown in the report either. Manually refreshing the dataset from the workspace did result in the new row appearing in the report:

Next, I used a Notebook to delete all the rows from the table:

At this point the report still showed the three rows displayed in the previous screenshot. Finally, I refreshed the dataset one more time and all the data disappeared from the report:

It’s important to stress that the reason that the refresh is needed to show the latest data in the table is not because the data is being loaded into the dataset. It’s because, as the docs say, refresh tells the dataset to look at the latest version of the data in the table – which leads on to the whole topic of time travel in Delta tables in Fabric. Dennes Torres has a nice blog post on this subject here which is a great place to start.

Why would you ever want to refresh a Direct Lake dataset manually? Again, the docs have the answer:

You may want to disable [refresh] if, for example, you need to allow completion of data preparation jobs before exposing any new data to consumers of the dataset. 

So, let’s say you need to load some new data to your table and also delete or update some data that’s already there and this needs to be done as several separate jobs. It’s very likely that you don’t want your Power BI reports to show any of the new data until all of these jobs have been completed, so to do this you will want to disable automatic refresh and do a manual dataset refresh as the last step of your ETL.

Performance Testing Power BI Direct Lake Mode Datasets In Fabric

If you’re excited about Direct Lake mode in Fabric you’re probably going to want to test it with some of your own data, and in particular look at DAX query performance. Before you do so, though, there are a few things to know about performance testing with Direct Lake datasets that are slightly different from what you might be used to with Import mode or DirectQuery datasets.

Dataset “hotness”

In my last post I talked about how, in Direct Lake datasets, Power BI can page individual column segments, dictionaries and join indexes into memory on demand when a DAX query is run and how those artefacts may get paged out later on. It therefore follows that there are four possible states or levels of “hotness” that a dataset can be in when a DAX query is run and that each of these states will have different performance characteristics:

  1. The column segments, dictionaries and join indexes needed to answer a query are not held in memory and need to be paged in before the query can run.
  2. Some of the column segments, dictionaries and join indexes needed to answer a query are not held in memory and need to be paged in, while some of them are already in memory.
  3. All of the column segments, dictionaries and join indexes needed by the query are already held in memory.
  4. All of the column segments, dictionaries and join indexes needed by the query are already held in memory and, as a result of previous query activity, Vertipaq engine caches useful for the query are also already populated.

State (1) is the “coldest” state and will give the worst possible query performance while state (4) is the “hottest” state and will give the best possible query performance.

When you’re testing the performance of a DAX query on a Direct Lake dataset you should test it on a dataset that is in state (1), state (3) and state (4) so you get a good idea of how much time is taken to page data into memory and how much of a performance improvement Vertipaq engine caching brings.

Ensuring everything is paged out

To test query performance in state (1) you need a way to ensure that all column segments, dictionaries and join indexes are paged out of memory. At the time of writing this post you can ensure this simply by refreshing the dataset. This will change sometime in the next few months though, because paging everything out of memory when you refresh is not ideal behaviour in the real world, so there will be another way to ensure everything is paged out. I’ll update this post when that change happens. You can ensure that all column segments and dictionaries have been paged out by running the two DMV queries mentioned in my previous post: DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS and DISCOVER_STORAGE_TABLE_COLUMNS.

[Update August 2025: the change I was talking about has finally happened and I have written about it here: https://blog.crossjoin.co.uk/2025/08/31/performance-testing-power-bi-direct-lake-models-revisited-ensuring-worst-case-performance/ You now need to do a refresh of type clearValues before doing a full refresh to ensure all data is paged out]

Why you should use DAX Studio for performance testing

I also recommend using DAX Studio to run queries when performance testing, for a number of reasons. First of all it makes it easy to clear the Vertipaq engine cache before a query is run with the “Clear Cache” and “Clear on Run” (which automatically clears the cache before each query) buttons. This not only runs a Clear Cache command, to clear the Vertipaq cache, it also runs a (hidden) DAX query that does not query any data from the dataset but which does trigger the creation of all the measures on the dataset. In most cases this is very fast, but if you have thousands of measures it could take a few seconds (similar to what I show here for report-level measures). If you are not using DAX Studio you can achieve the same result by running a query like:

EVALUATE {1}

DAX Studio also lets you run the same query multiple times using its “Run benchmark” feature (although this only lets you test for states (3) and (4) at the time of writing) and its “Server Timings” feature is invaluable for understanding what’s going on inside the Vertipaq engine when a query runs.

Also make sure you are running the very latest version of DAX Studio (which is 3.0.8 at the time of writing) to make sure it works properly with Fabric.

Performance testing methodology

So, putting this all together, in order to run a single performance test on a Direct Lake dataset to capture performance for states (1), (3) and (4):

  1. Preparation
    • Create a custom Power BI dataset in Fabric (not everything here works with a default dataset at the time of writing)
    • Open your report in Power BI Desktop connected to your published Direct Lake dataset
    • Capture the DAX queries generated by the visuals you want to test using Performance Analyzer by clicking Copy Query (see here for how to do this)
    • Install the very latest version of DAX Studio and open it
    • Connect DAX Studio to your workspace’s XMLA Endpoint (see here for how to do this)
    • Paste your DAX query into DAX Studio
    • Turn on DAX Studio’s Server Timings option
    • Ensure the “Clear on Run” option in the ribbon is turned off
  2. To test performance for state (1):
  3. To test performance for state (3), immediately after the previous steps:
    • Click the “Clear Cache” button on the ribbon in DAX Studio
    • Run the DAX query
    • Save the output of the Server Timings pane in DAX Studio
  4. To test the performance for state (4), immediately after the previous steps:
    • Run the DAX query again without clicking the “Clear Cache” button
    • Save the output of the Server Timings pane

Fallback to DirectQuery

Even with a Direct Lake dataset there is no guarantee that your query will be answered in Direct Lake mode: in as-yet not fully documented scenarios (but basically if your data volumes are too large for the capacity you’re using) Power BI will switch to using DirectQuery mode against the Lakehouse’s SQL Endpoint. One of the objectives of your performance testing should be to make sure that this happens as infrequently as possible because DirectQuery mode will perform noticeably worse than Direct Lake mode.

You may notice some DirectQuery events even when the query itself only uses Direct Lake; these are used to get metadata or security information from the Lakehouse and can be ignored. Here’s an example of this:

Load testing

Testing for a single user is important, but don’t forget about testing performance with multiple concurrent users. As I discuss here, realistic load tests are the only way to get a good idea of how your report will actually perform in production and you can load test a Direct Lake dataset in exactly the same way as an Import mode or DirectQuery dataset.

Direct Lake is still in preview!

The last point to make is that like the rest of Fabric, Direct Lake is still in preview. This not only means that functionality is missing, it also means that performance is not yet as good as it will be yet. So, by all means test Direct Lake and tell us how fast (or not) it is, but be aware that your test results will be out of date very quickly as the engine evolves.

[Thanks to Krystian Sakowski for the information in this post]