DAX · Performance Tuning · Power BI

IF, SWITCH And The Effect Of DAX Variables On Strict/Eager Evaluation

A lot of DAX performance problems relate to the use of strict and eager evaluation with the IF or SWITCH functions. It’s an incredibly complex, “it depends”, black-box type of topic and naturally Marco and Alberto have a great article on it here. Rather than describe any general rules – which would be pretty much impossible – in this blog post I want to show a specific scenario that illustrates how the use of DAX variables can influence whether strict or eager evaluation takes place.

Consider a Power BI semantic model that has the following three tables:

TableA and TableB are fact tables with numeric measure columns; Choice is a disconnected table containing text values that is intended for use in a slicer, so a report user can select an item in that slicer and that in turn influences what values a measure returns:

Here’s the definition of one such measure:

With Variables =
VAR a =
    SUM ( TableA[A] )
VAR b =
    SUM ( TableB[B] )
RETURN
    IF ( SELECTEDVALUE ( Choice[Choice] ) = "TableA", a, b )
)

And here’s my test report that includes a slicer and a card visual displaying the output of this measure:

Let’s look at the DAX query generated by the card visual containing the measure when “TableA” is selected in the slicer:

DEFINE
    VAR __DS0FilterTable =
        TREATAS ( { "TableA" }, 'Choice'[Choice] )

EVALUATE
SUMMARIZECOLUMNS (
    __DS0FilterTable,
    "With_Variables", IGNORE ( 'Choice'[With Variables] )
)

…and in particular what DAX Studio’s Server Timings feature shows us:

Even though only TableA is selected in the slicer and the query only returns the sum of the values in the A column of TableA, we can see that the Storage Engine is also querying TableB and getting the sum of the B column. It’s a great example of eager evaluation: both branches of the IF are being evaluated. Is this a bad thing? For this particular report it may be if the Storage Engine query for TableB is expensive.

How can you force strict evaluation to take place? You can force eager evaluation using the IF.EAGER function but there is no equivalent function to force strict evaluation. However you maybe be able to rewrite the measure to get strict evaluation to take place.

The key factor in this case is the use of variables in the measure definition. If you rewrite the measure to not use variables, like so:

No Variables =
IF (
    SELECTEDVALUE ( Choice[Choice] ) = "TableA",
    SUM ( TableA[A] ),
    SUM ( TableB[B] )
)

…then the query for the card visual query behaves differently:

DEFINE VAR __DS0FilterTable = 
TREATAS({"TableA"}, 'Choice'[Choice])

EVALUATE
SUMMARIZECOLUMNS(__DS0FilterTable, "No_Variables", IGNORE('Choice'[No Variables]))

Notice that there are now only two Storage Engine queries and that TableB is not now being queried, which will probably result in better performance. This is strict evaluation.

Why does the use of variables result in the use of eager evaluation here? Because it does, and it’s complicated. I need to stress that DAX uses lazy evaluation for variables which means that variables are not evaluated if they are not used – in the first measure above the IF is deliberately evaluating both branches. There are certainly other optimisations that may kick in and result in strict evaluation even when variables are used in IF/SWITCH. Indeed, the two measures in this example being from different fact tables is extremely important: if they had been from the same fact table then the behaviour would have been different and strict evaluation would have been used. In summary, though, if you are using variables and you want to try to force strict evaluation with IF/SWITCH then it’s worth rewriting your code to remove those variables to see if it makes a difference.

I also need to stress that these two measures will perform better or worse depending on how and where they are used. Consider these two table visuals that use the values from the Choice table on rows along with the two measures above:

Running the DAX query for the table on the left, which uses the measure with no variables and strict evaluation, gives the following in DAX Studio’s Server Timings:

Note that there are two Storage Engine queries for TableB, which is not a good thing.

On the other hand, the table on the right which uses the [No Variables] measure gives the following:

Note that there is only one Storage Engine query for TableB now, so in this case the tables have turned and the [With Variables] measure is likely to perform better.

When you use variables in the branches of IF/SWITCH, as in the [With Variables] measure, the variables are evaluated in a context at the place of the variable definition; if you don’t use variables in this way, as in the [No Variables] measure, the two SUM expressions used in the two branches are evaluated in the context of the branch, which adds a hidden filter context corresponding to the condition. This has two consequences which may hurt query performance:

  1. During evaluation, the hidden filter may be materialised into a large in-memory table
  2. Fusion cannot happen across common subexpressions in different branches because they have different contexts

In contrast the use of variables means the expressions used in them are evaluated in a context without the hidden filter corresponding to the branch, which can encourage fusion and discourage the materialisation of large in-memory tables.

[Thanks to Marius Dumitru and Jeffrey Wang for much of the information in this post]

Excel · Performance Tuning · Power BI · Power Query

Excel Workbook Layout And The Performance Of Reading Data With Power Query In Power BI

Excel workbooks are one of the slowest data sources you can use with Power Query in Excel or Power BI. Reading small amounts of data from small workbooks is usually fast; reading large amounts of data from large workbooks can be very slow. But what about reading small amounts of data from large Excel workbooks? I did some tests and it turns out that performance can vary a lot depending on where your data is in the workbook and how that workbook is structured.

[Note: in this post I’ll be looking at .xlsx files, rather than other Excel formats like .xls and .xlsb; Excel files stored on a local disk and accessed via the File.Contents M function rather than stored in SharePoint or any other location; data read from Excel tables rather than direct from the worksheet; and Power Query in Power BI. Other scenarios may perform differently.]

Let’s see a simple example to illustrate what I found out. I created a new Excel workbook with one worksheet in and put a small table of data on it:

At this point the workbook’s size was 11KB. I then opened Power BI Desktop and created a Power Query query that read this table of data from the Excel workbook:

let
    Source = Excel.Workbook(File.Contents("C:\MyWorkbook.xlsx"), null, true),
    Sales_Table = Source{[Item="Sales",Kind="Table"]}[Data],
    #"Changed Type" = Table.TransformColumnTypes(Sales_Table,{{"Product", type text}, {"Sales", Int64.Type}})
in
    #"Changed Type"

Then I used this technique to measure how long it took to load the data from Excel into Power BI. Unsurprisingly, it was extremely fast: 63ms.

Then I added a new worksheet to the workbook, copied the same table onto it, added a large amount of random numbers underneath using the following Excel formula, and then copied and pasted the values returned by the formula over the output of the formula:

=RANDARRAY(9999,300)

Doing this meant the size of the workbook grew to 43MB. I then created a new Power Query query in Power BI Desktop, identical to the one above except that it connected to the new table. This time the query took 4918ms – almost 5 seconds.

Interestingly, even with the second worksheet with all the data on was added, the first query above (on the worksheet with no other data on) was still fast. I also tested refreshing a Power BI dataset that connected to two identical small tables on different worksheets in the same workbook, both with large amounts of other data on as in the second scenario above, and the performance of both queries was only slightly slower: it was clear two Power Query queries can read data from the same Excel workbook in parallel.

So: reading a small amount of data from a table on a worksheet with a large amount of other data on it is very slow.

What can we learn from this? Well, if you can influence the structure and layout of the Excel workbooks you are using as a data source – and that’s a big if, because in most cases you can’t – and you only need to read some of the data from them, you should put the tables of data you are using as a source on separate worksheets and not on the same worksheet as any other large ranges or tables of data.

It turns out that when the Power Query Excel connector reads data from an .xlsx file it can deserialise just some of the data in it rather than the whole thing, but what it can and can’t avoid deserialising depends a lot on the structure of the workbook and how the data is stored within the workbook .xlsx file. If you’re quick you can even see how much data is being read in Power BI Desktop in the refresh dialog:

You can also use Process Monitor, as I describe here, to see how much data is being read from any file used by Power Query.

Performance also depends on which application generated the .xlsx file (it’s not just Excel that creates .xlsx files, because other applications export data to .xlsx format without using Excel) or even which version of Excel saved the .xlsx file. This is because the same data can be stored in an .xlsx file in different ways, some of which may be more efficient to read than others. I found this blog post by Brendan Long on the .xlsx file format was very clear and it helped me understand how Power Query might go about reading data from an .xlsx file.

[Thanks to Curt Hagenlocher of the Power Query team for answering some of my questions relating to this post]

Direct Lake · Import Mode · Performance Tuning · Power BI

Exploring Power BI Run-Length Encoding With DMVs

Recently I was involved in an interesting discussion on Twitter X about how partitioning a table in a Power BI dataset semantic model can affect compression and therefore its size and query performance. You can read the thread here. It got me thinking: is there a way to get more detail on how well compression, and in particular run-length encoding (RLE from hereon), is working for a column when you’re using Import mode or Direct Lake? After a bit of research I found out there is, so let’s see some examples that illustrate what I learned.

First of all, consider the following M query that returns a table with one numeric column called MyNumbers:

let
    Source = {1..DistinctValues},
    Repeat = List.Transform(Source, each List.Repeat({_}, TotalRows/DistinctValues)),
    Combine = List.Combine(Repeat),
    #"Converted to Table" = Table.FromList(Combine, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "MyNumbers"}}),
    #"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"MyNumbers", Int64.Type}})
in
    #"Changed Type"

It references two M parameters: TotalRows, which determines the total number of rows in the table, and DistinctValues, which determines the number of distinct numeric values in the MyNumbers column. With TotalRows=9 and DistinctValues=3, it returns the following output:

Note that in this case it returns 3 rows with the value 1 repeated, 3 rows with the value 2 repeated and 3 three rows with the value 3 repeated; three sets of repeated values in all. It’s fair to assume that repeated sets of values like this are a good candidate for RLE.

I created a semantic model in Import mode containing only this table and published it to the Power BI Service. Initially TotalRows was set to 1,000,000 and DistinctValues was set to 100 – so the table consisted of just 100 sets of 10,000 repeated values. I chose 1,000,000 rows because that’s the size of a segment in the Power BI Service with the “small semantic model format” setting and any compression that takes place always takes place within a segment.

When the Analysis Services engine inside Power BI compresses data it looks for sequences of repeated values to see if RLE can be used. If it finds them, these sequences result in “pure” RLE runs; if it doesn’t find these sequences they are called “impure” RLE runs and the values are stored using bitpack compression. Pure runs are generally a good thing, impure runs generally a bad thing. You can see how many pure and impure runs there are using the TMSCHEMA_COLUMN_STORAGES DMV, for example with the following DMV query:

select 
[Name], Statistics_DistinctStates, Statistics_RowCount, 
Statistics_RLERuns, Statistics_OthersRLERuns 
from $SYSTEM.TMSCHEMA_COLUMN_STORAGES

Running this query in DAX Studio on my published semantic model returned the following table:

[You can ignore all the rows except the one for the MyNumbers column in this table]

The Statistics_RLERuns column shows the number of pure RLE runs; the Statistics_OthersRLERuns column shows the number of impure RLE runs. In this case you can see, for the MyNumbers column, there were 100 pure RLE runs and no impure runs, so as expected RLE is working well.

Here’s what Vertipaq Analyzer showed for this table:

Unsurprisingly the size of the MyNumbers column is very small.

Then I changed DistinctValues to 100,000 (keeping TotalRows at 1,000,000), giving me 100,000 sets of 10 values, and refreshed the dataset. Here’s what the DMV query on TMSCHEMA_COLUMN_STORAGES returned:

And here’s what Vertipaq Analyzer showed:

As you can see, the column was a lot larger than before; there were no pure RLE runs and one impure RLE run. In this case the large number of distinct values in the column prevented RLE from taking place and this had a negative impact on the size of the column.

These are two extreme cases. What about a scenario that’s somewhere in between? I modified my M query as follows:

let  
    RepeatedNumbers = 
    let
    Source = {1..DistinctValues},
    Repeat = List.Transform(Source, each List.Repeat({_}, ((TotalRows/2)/DistinctValues))),
    Combine = List.Combine(Repeat),
    #"Converted to Table" = Table.FromList(Combine, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Renamed Columns" = Table.RenameColumns(#"Converted to Table",{{"Column1", "MyNumbers"}}),
    #"Changed Type" = Table.TransformColumnTypes(#"Renamed Columns",{{"MyNumbers", Int64.Type}})
in
    #"Changed Type",

    RandomNumbers = 
    let
    Source = {1..TotalRows/2},
    #"Converted to Table" = Table.FromList(Source, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
    #"Added Custom" = Table.AddColumn(#"Converted to Table", "MyNumbers", each Number.Round(Number.RandomBetween(TotalRows+1, TotalRows*2))),
    #"Removed Other Columns" = Table.SelectColumns(#"Added Custom",{"MyNumbers"}),
    #"Changed Type" = Table.TransformColumnTypes(#"Removed Other Columns",{{"MyNumbers", Int64.Type}})
in
    #"Changed Type",

    Output = Table.Combine({RepeatedNumbers, RandomNumbers})
in
    Output

What this version of the code does is return a table where the first 50% of the rows are repeated numbers and the second 50% are random numbers. With TotalRows set to 12 and DistinctValues set to 2 it produces the following output:

With this version published to the Power BI Service I set TotalRows to 1,000,000 again and set DistinctValues to 2000, resulting in a table with 2000 sets of 250 repeating values followed by 500,000 random values. Here’s what the DMV query against TMSCHEMA_COLUMN_STORAGES returned:

As you can see there are now 2000 pure runs (I assume for the first 50% of rows with repeated values) and 1 impure run (I assume for the second 50% of rows with random values).

Here’s the output of Vertipaq Analyzer:

The column is now almost as large as in the second scenario above.

You can get a bit more detail about what’s happening in the impure runs with the DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS DMV. Running the following query against the latest version of the table:

select 
column_ID, partition_name, segment_number,
records_count, bits_count, used_size
from $SYSTEM.DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS

…returns the following values:

To get a rough idea of the number of rows in the impure runs you can use the following formula:

(used_size * 8)/bits_count

In this case for the MyNumbers column (1349720 * 8)/21 = 514,179 which makes sense since my code returns 500,000 rows of random numbers. The records_count column in this query returns the total number of rows in the segment, so the higher the numberof rows in impure runs relative to the total, the worse compression you’re getting.

What practical use is this information? Probably not much as you might think, interesting as it is. It can tell you how well RLE is working for a column but it doesn’t tell you much about how to optimise it, or if it is possible to optimise it, or if optimising it is a good idea – that’s a subject for another blog post.

[Thanks to Marius Dumitru and Akshai Mirchandani for the information in this post]

Dataflows · Fabric · Performance Tuning

Fabric Dataflows Gen2: To Stage Or Not To Stage?

If you read this post that was published on the Fabric blog back in July, you’ll know that each Power Query query in a Fabric Gen2 dataflow has a property that determines whether its output is staged or not – where “staged” means that the output is written to the (soon-to-be hidden) Lakehouse linked to the dataflow, regardless of whether you have set a destination for the query output to be written to. Turning this on or off can have a big impact on your refresh times, making them a lot faster or a lot slower. You can find this property by right-clicking on the query name in the Queries pane:

At the moment this property is on by default for every query although this may change in the future. But should you turn it on for the queries in your Gen2 dataflows? It depends, and you should test to see what gives you the best performance.

Let’s see a simple example. I uploaded a CSV file from my favourite data source, the Land Registry price paid data, with about a million rows in it to the files section of a Lakehouse, then created a query that did a group by on one of the columns to find the number of property transactions by each county in England and Wales. The query was set to load its output to a table in a Warehouse.

Here’s the diagram view for this query:

I then made sure that staging was turned off for this query:

This means that the Power Query engine did the group by itself as it read the data from the file.

Looking at the refresh history for this dataflow:

…showed that the query took between 18-24 seconds to run. Clicking on an individual refresh to see the details:

…showed a single activity to load the output to the Warehouse. Clicking on this activity to see more details:

…shows how long it took – 15 seconds – plus how many rows were loaded to the destination Warehouse and how much data.

I then created a second dataflow to see the effect of staging. It’s important to understand that copying the previous dataflow and enabling staging on the only query in it does not do what I wanted here: I had to create two queries, one with staging enabled (called PP here) and no destination set to stage all the raw data from the CSV file, and a second one (called Counties here) that references the first with staging disabled and its destination set to the Warehouse I used in the previous dataflow to do the group by.

Here’s the diagram view for these two queries:

Note the blue outline on the PP query which indicates that it’s staged and the grey outline on the Counties query that indicates that it is not staged.

Looking at the Refresh History for this dataflow showed that it took around 40 seconds to run on average:

Looking at the first level of detail for the last refresh showed the extra activity for staging the data:

Clicking on the details for this staging activity for the PP table showed that it took 17 seconds to load all the raw data:

The activity to write the data to the Warehouse took about the same as with the first dataflow:

In summary, the first dataflow clearly performs better than the second dataflow. In this case, therefore, it looks like the overhead of staging the data made the performance worse.

Don’t take this simple example to prove a general rule: every dataflow will be different and there are a lot of performance optimisations planned for Dataflows Gen2 over the next few months, so you should test the impact of staging for yourself. I can imagine for different data sources (a Lakehouse source is likely to perform very well, even for files) and different transformations then staging will have a positive impact. On the other hand if you’re struggling with Dataflows Gen2 performance, especially at the time of writing this post, turning off staging could lead to a performance improvement.

Log Analytics · Performance Tuning · Power BI · Refresh

Exploring Power BI Import Mode Throughput With Log Analytics, Part 3: Throughput Variations During A Refresh

In the second post in this series I discussed a KQL query that can be used to analyse Power BI refresh throughput at the partition level. However, if you remember back to the first post in this series, it’s actually possible to get much more detailed information on throughput by looking at the ProgressReportCurrent event, which fires once for every 10000 rows read during partition refresh.

Here’s yet another mammoth KQL query that you can use to analyse the ProgressReportCurrent event data:

PowerBIDatasetsWorkspace
| where TimeGenerated > ago(1d)
| where OperationName == "ProgressReportCurrent" or OperationName=="ProgressReportBegin" or OperationName == "ProgressReportEnd"
| where XmlaObjectPath!=""
| where OperationDetailName=="ReadData"
| project OperationName, XmlaObjectPath, EventText, TimeGenerated, ProgressCounter, OperationDetailName, XmlaRequestId, ObjectPathRequestId = strcat(XmlaObjectPath ,XmlaRequestId)
| partition by ObjectPathRequestId
(
    sort by TimeGenerated asc
    | scan declare( PrevEventTimeGenerated:datetime, TimeSincePrevEvent:timespan=time(0s), TimeSinceRefreshBegin:timespan=time(0s), NumberOfRowsRead:int=0) with 
    (
        step RefreshStarts: OperationName=="ProgressReportBegin";
        step RefreshInProgress: OperationName == "ProgressReportCurrent" =>
            PrevEventTimeGenerated = iff(ProgressCounter==10000, RefreshStarts.TimeGenerated, RefreshInProgress.TimeGenerated),
            TimeSincePrevEvent = iff(ProgressCounter==10000, TimeGenerated - RefreshStarts.TimeGenerated, TimeGenerated-RefreshInProgress.TimeGenerated),
            TimeSinceRefreshBegin = TimeGenerated - RefreshStarts.TimeGenerated,
            NumberOfRowsRead = iff(ProgressCounter==10000, 10000, ProgressCounter-RefreshInProgress.ProgressCounter);
        step RefreshEnds: OperationName == "ProgressReportEnd" => 
            PrevEventTimeGenerated = RefreshInProgress.TimeGenerated,
            TimeSincePrevEvent = TimeGenerated - RefreshInProgress.TimeGenerated,
            TimeSinceRefreshBegin = TimeGenerated - RefreshStarts.TimeGenerated,
            NumberOfRowsRead = ProgressCounter - RefreshInProgress.ProgressCounter;
    )
    | where OperationName == "ProgressReportCurrent" or OperationName == "ProgressReportEnd"
)
| project XmlaObjectPath, Table = split(XmlaObjectPath,".", 2)[0], Partition = split(XmlaObjectPath,".", 3)[0], OperationName, 
StartTime = format_datetime(PrevEventTimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), EndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), 
ProgressCounter, SecondsSincePreviousEvent = TimeSincePrevEvent/1s, SecondsSinceRefreshBegin=TimeSinceRefreshBegin/1s,NumberOfRowsRead, 
NumberOfRowsReadPerSecond=NumberOfRowsRead/(TimeSincePrevEvent/1s), XmlaRequestId

Here’s a brief explanation of what it does:

  • It filters the Log Analytics data down to get events from the last day and just the ProgressReportCurrent events, as well as the ProgressReportBegin/End events which are fired before and after the ProgressReportCurrent events.
  • It then splits the data into groups of rows (‘partitions’ in KQL, but of course not the partitions that are being refreshed) by a combination of XmlaRequestId (ie the refresh operation) and XmlaObjectPath (ie the partition that is being refreshed)
  • For each group of rows it will then:
    • Find the ProgressReportBegin event and from this get the time when data started to be read from the source
    • Get all subsequent ProgressReportCurrent events and calculate the amount of time elapsed since the previous event (which might be the ProgressReportBegin event or a previous ProgressReportCurrent event) and the number of rows read
    • When the ProgressReportEnd event is encountered, calculate the amount of time elapsed since the previous ProgressReportCurrent event and the number of rows (which will be less than 10000) read since then
  • Filter out the ProgressReportBegin events because we don’t need them any more
  • Finally, add columns that splits out the table name and partition name and calculates the number of rows read per second for each row by dividing the number of rows read for each event by the amount of time elapsed since the previous event

What can this query tell us about throughput?

First of all, something interesting but not necessarily useful. At least for the data source I’m using for my tests, when I plot a column chart with the number of rows read on the x axis and the amount of time elapsed since the last event on the y axis (ie the amount of time it takes to read 10000 rows for all but the last column) then I noticed that every 200000 rows something happens to slow down the read:

I have no idea what this is, whether it’s a quirk of this particular source or connector, but it’s a great example of the kind of patterns that become obvious when you visualise data rather than look at a table of numbers.

Plotting time on the x axis of a line chart and the cumulative total of rows read on the y axis gives you something more useful. Here’s the chart for one of the refreshes mentioned in my last post where four partitions of the same table are refreshed in parallel:

In this case throughput is fine up until the end of the refresh at which point something happens to the February, March and April partitions but not the January partition to slow them down for about 30 seconds, after which throughput goes back to what it was before. Here’s the same chart zoomed in a bit:

Here’s the same problem shown in the first graph above, where the number of rows read is on the x axis, showing how for example with the April partition there’s a sudden spike where it takes 14 seconds to read 10000 rows rather than around 0.3 seconds:

What is this, and why isn’t the January partition affected? Maybe it was a network issue or caused by something happening in the source database? Looking at another refresh that also refreshes the same four partitions in parallel, it doesn’t seem like the same thing happens – although if you look closely at the middle of the refresh there might be a less pronounced flattening off:

Again, the point of all this is not the mysterious blips I’ve found in my data but the fact that if you take the same query and look at your refreshes, you may find something different, something more significant and something you can explain and do something about.

Log Analytics · Performance Tuning · Power BI · Refresh

Exploring Power BI Import Mode Throughput With Log Analytics, Part 2: High-Level Statistics

In the first post in this series I described the events in Log Analytics that can be used to understand throughput – the speed that Power BI can read from your dataset when importing data from it – during refresh. While the individual events are easy to understand when you look at a simple example they don’t make it easy to analyse the data in the real world, so here’s a KQL query that takes all the data from all these events and gives you one row per partition per refresh:

//Headline stats for partition refresh with one row for each partition and refresh
//Get all the data needed for this query and buffer it in memory
let RowsForStats = 
materialize(
  PowerBIDatasetsWorkspace
| where TimeGenerated > ago(1d)
| where OperationName == "ProgressReportEnd"
| where OperationDetailName == "ExecuteSql" or OperationDetailName == "ReadData" 
or (OperationDetailName == "TabularRefresh" and (EventText contains "partition"))  
);
//Get just the events for the initial SQL execution phase
let ExecuteSql = 
RowsForStats
| where OperationDetailName == "ExecuteSql"
| project XmlaRequestId, XmlaObjectPath, 
ExecuteSqlStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ), 
ExecuteSqlEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), 
ExecuteSqlDurationMs = DurationMs, ExecuteSqlCpuTimeMs = CpuTimeMs;
//Get just the events for the data read and calculate rows read per second
let ReadData =
RowsForStats
| where OperationDetailName == "ReadData"
| project XmlaRequestId, XmlaObjectPath, 
ReadDataStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ), 
ReadDataEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), 
ReadDataDurationMs = DurationMs, ReadDataCpuTime = CpuTimeMs, 
TotalRowsRead = ProgressCounter, RowsPerSecond = ProgressCounter /(toreal(DurationMs)/1000);
//Get the events for the overall partition refresh
let TabularRefresh =
RowsForStats
| where OperationDetailName == "TabularRefresh"
| parse EventText with * '[MashupCPUTime: ' MashupCPUTimeMs:long ' ms, MashupPeakMemory: '  MashupPeakMemoryKB:long ' KB]'
| project XmlaRequestId, XmlaObjectPath, 
TabularRefreshStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ), 
TabularRefreshEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), 
TabularRefreshDurationMs = DurationMs, TabularRefreshCpuTime = CpuTimeMs, 
MashupCPUTimeMs, MashupPeakMemoryKB;
//Do an inner join on the three tables so there is one row per partition per refresh
ExecuteSql
| join kind=inner ReadData on XmlaRequestId, XmlaObjectPath
| join kind=inner TabularRefresh on XmlaRequestId, XmlaObjectPath
| project-away XmlaRequestId1, XmlaRequestId2, XmlaObjectPath1, XmlaObjectPath2
| extend Table = tostring(split(XmlaObjectPath,".", 2)[0]), Partition = tostring(split(XmlaObjectPath,".", 3)[0])
| project-reorder XmlaRequestId, Table, Partition
| order by XmlaRequestId, ExecuteSqlStartTime desc

It’s a bit of a monster query but what it does is quite simple:

  • First it gets all the events relating to partition refresh in the past 1 day (which of course you can change) and materialises the results.
  • Then it filters this materialised result and gets three sets of tables:
    • All the ExecuteSql events, which tell you how long the data source took to start returning data and how much CPU time was used.
    • All the ReadData events, which tell you how long Power BI took to read all the rows from the source after the data started to be returned, how much CPU time was used, and how many rows were read. Dividing duration by rows read lets you calculate the number of rows read per second during this phase.
    • All the TabularRefresh events, which give you overall data on how long the partition refresh took, how much CPU time was used, plus information on Power Query peak memory usage and CPU usage.

What can this tell us about refresh throughput though? Let’s use it to answer some questions we might have about throughput.

What is the impact of parallelism on throughput? I created a dataset on top of the NYC taxi data Trip table with a single table, and in that table created four partitions containing data for January, February, March and April 2013, each of which contained 13-15 million rows. I won’t mention the type of data source I used because I think it distracts from what I want to talk about here, which is the methodology rather than the performance characteristics of a particular source.

I then ran two refreshes of these four partitions: one which refreshed them all in parallel and one which refreshed them sequentially, using custom TSML refresh commands and the maxParallelism property as described here. I did a refresh of type dataOnly, rather than a full refresh, in the hope that it would reduce the number of things happening in the Vertipaq engine during refresh that might skew my results. Next, I used the query above as the source for a table in Power BI (for details on how to use Log Analytics as a source for Power BI see this post; I found it more convenient to import data rather than use DirectQuery mode though) to visualise the results.

Comparing the amount of time taken for the SQL query used to start to return data (the ExecuteSqlDurationMs column from the query above) for the four partitions for the two refreshes showed the following:

The times for the four partitions vary a lot for the sequential refresh but are very similar for the parallel refresh; the January partition, which was refreshed first, is slower in both cases. The behaviour I described here regarding the first partition refreshed in a batch could be relevant.

Moving on to the Read Data phase, looking at the number of rows read per second (the RowsPerSecond column from the query above) shows a similar pattern:

There’s a lot more variation in the sequential refresh. Also, as you would expect, the number of rows read per second is much higher when partitions are refreshed sequentially compared to when they are refreshed in parallel.

Looking at the third main metric, the overall amount of time taken to refresh each partition (the TabularRefreshDurationMs column from the query above) again shows no surprises:

Each individual partition refreshes a lot faster in the sequential refresh – almost twice as fast – compared to the parallel refresh. Since four partitions are being refreshed in parallel during the second refresh, though, this means that any loss of throughput for an individual partition as a result of refreshing in parallel is more than compensated for by the parallelism, making the parallel refresh faster overall. This can be shown using by plotting the TabularRefreshStartTime and TabularRefreshEndTime columns from the query above on a timeline chart (in this case the Craydec Timelines custom visual) for each refresh and each partition:

On the left of the timeline you can see the first refresh where the partitions are refreshed sequentially, and how the overall duration is just over 20 minutes; on the right you can see the second refresh where the partitions are refreshed in parallel, which takes just under 10 minutes. Remember also that this is just looking at the partition refresh times, not the overall time taken for the refresh operation for all partitions, and it’s only a refresh of type dataOnly rather than a full refresh.

So does this mean more parallelism is better? That’s not what I’ve been trying to say here: more parallelism is better for overall throughput in this test but if you keep on increasing the amount of parallelism you’re likely to reach a point where it makes throughput and performance worse. The message is that you need to test to see what the optimal level of parallelism – or any other factor you can control – is for achieving maximum throughput during refresh.

These tests only show throughput at the level of the ReadData event for a single partition, but as mentioned in my previous post there is even more detailed data available with the ProgressReportCurrent event. In my next post I’ll take a closer look at that data.

[Thanks to Akshai Mirchandani for providing some of the information in this post, and hat-tip to my colleague Phil Seamark who has already done some amazing work in this area]

Log Analytics · Performance Tuning · Power BI · Refresh

Exploring Power BI Import Mode Throughput With Log Analytics, Part 1: The Events

If you’re tuning refresh for a Power BI Import mode dataset one of the areas you’ll be most interested in is throughput, that is to say how quickly Power BI can read data from the data source. It can be affected by a number of different factors: how quickly the data source can return data; network latency; the efficiency of the connector you’re using; any transformations in Power Query; the number of columns in the data and their data types; the amount of other objects in the same Power BI dataset being refreshed in parallel; and so on. How do you know if any or all of these factors is a problem for you? It’s a subject that has always interested me and now that Log Analytics for Power BI datasets is GA we have a powerful tool to analyse the data, so I thought I’d do some testing and write up my findings in a series of blog posts.

The first thing to understand is what events there are in Log Analytics that provide data on throughput. The events to look at are ProgressReportBegin, ProgressReportCurrent and ProgressReportEnd (found in the OperationName column), specifically those with OperationDetailName ExecuteSql, ReadData and Tabular Refresh. Consider the following KQL query that looks for at this data for a single refresh and for a single partition:

let RefreshId = "e5edc0de-f223-4c78-8e2d-01f24b13ccdc";
let PartitionObjectPath = "28fc7514-d202-4969-922a-ec86f98a7ea2.Model.TIME.TIME-ffca8cb8-1570-4f62-8f04-993c1d1d17cb";
PowerBIDatasetsWorkspace
| where TimeGenerated > ago(3d)
| where XmlaRequestId == RefreshId
| where XmlaObjectPath == PartitionObjectPath
| where OperationDetailName == "ExecuteSql" or OperationDetailName == "ReadData" or OperationDetailName == "TabularRefresh"
| project XmlaObjectPath, Table = split(XmlaObjectPath,".", 2)[0], Partition = split(XmlaObjectPath,".", 3)[0], 
TimeGenerated, OperationName, OperationDetailName, EventText, DurationMs, CpuTimeMs, ProgressCounter
| order by XmlaObjectPath, TimeGenerated asc;

Some notes on what this query does:

  • All the data comes from the PowerBIDatasetsWorkspace table in Log Analytics
  • I’ve put an arbitrary filter on the query to only look for data in the last three days, which I know contains the data for the specific refresh I’m interested in
  • An individual refresh operation can be identified by the value in the XmlaRequestId column and I’m filtering by that
  • An individual partition being refreshed can be identified by the value in the XmlaObjectPath column and I’m filtering by that too
  • The value in the XmlaObjectPath column can be parsed to obtain both the table name and partition name
  • The query filters the events down to those mentioned above

Here are some of the columns from the output of this query, showing data for the refresh of a single table called TIME with a single partition:

Some more notes on what we can see here:

  • The TimeGenerated column gives the time of the event. Although the time format shown here only shows seconds, it actually contains the time of the event going down to the millisecond level – unlike in a Profiler trace, where time values are rounded to the nearest second and are therefore lot less useful.
  • The first event returned by this query is a ProgressReportBegin event of type TabularRefresh which marks the beginning of the partition refresh.
  • As I blogged here, after that are a pair of ProgressReportBegin/End events of type ExecuteSql. The value in the DurationMs column tells you how long it takes for the data source (which includes time taken by the actual query generated against the data source and any Power Query transformations) to start to return data – which was, in this case, 6016 milliseconds or 6 seconds.
  • Next there is a ProgressReportBegin event which indicates the beginning of data being read from the source.
  • After that there are a series of ProgressReportCurrent events which mark the reading of chunks of 10000 rows from the source. The ProgressCounter column shows the cumulative number of rows read.
  • Next there is a ProgressReportEnd event that marks the end of the data read. The ProgressCounter shows the total number of rows read (which must be less than 10000 rows greater than the value in the ProgressCounter column for the previous ProgressReportCurrent event); the DurationMs column shows the total time taken to read the data. In this case 86430 rows of data were read in 1.4 seconds.
  • Finally there is a ProgressReportEnd event of type TabularRefresh, the pair of the first event shown returned. It not only shows the total amount of time taken and the CPU time used to refresh the partition (which includes other operations that are nothing to do with throughput such as compressing data and building dictionaries) in the DurationMs column, as I blogged here it also includes data on the CPU and peak memory used by Power Query during the refresh. In this case the total refresh time was 7.6 seconds, so not much more than the time taken to read the data.

Clearly there’s a lot of useful, interesting data here, but how can we analyse it or visualise it? Stay tuned for my next post…

Direct Lake · Fabric · Performance Tuning · Power BI

Performance Testing Power BI Direct Lake Mode Datasets In Fabric

If you’re excited about Direct Lake mode in Fabric you’re probably going to want to test it with some of your own data, and in particular look at DAX query performance. Before you do so, though, there are a few things to know about performance testing with Direct Lake datasets that are slightly different from what you might be used to with Import mode or DirectQuery datasets.

Dataset “hotness”

In my last post I talked about how, in Direct Lake datasets, Power BI can page individual column segments, dictionaries and join indexes into memory on demand when a DAX query is run and how those artefacts may get paged out later on. It therefore follows that there are four possible states or levels of “hotness” that a dataset can be in when a DAX query is run and that each of these states will have different performance characteristics:

  1. The column segments, dictionaries and join indexes needed to answer a query are not held in memory and need to be paged in before the query can run.
  2. Some of the column segments, dictionaries and join indexes needed to answer a query are not held in memory and need to be paged in, while some of them are already in memory.
  3. All of the column segments, dictionaries and join indexes needed by the query are already held in memory.
  4. All of the column segments, dictionaries and join indexes needed by the query are already held in memory and, as a result of previous query activity, Vertipaq engine caches useful for the query are also already populated.

State (1) is the “coldest” state and will give the worst possible query performance while state (4) is the “hottest” state and will give the best possible query performance.

When you’re testing the performance of a DAX query on a Direct Lake dataset you should test it on a dataset that is in state (1), state (3) and state (4) so you get a good idea of how much time is taken to page data into memory and how much of a performance improvement Vertipaq engine caching brings.

Ensuring everything is paged out

To test query performance in state (1) you need a way to ensure that all column segments, dictionaries and join indexes are paged out of memory. At the time of writing this post you can ensure this simply by refreshing the dataset. This will change sometime in the next few months though, because paging everything out of memory when you refresh is not ideal behaviour in the real world, so there will be another way to ensure everything is paged out. I’ll update this post when that change happens. You can ensure that all column segments and dictionaries have been paged out by running the two DMV queries mentioned in my previous post: DISCOVER_STORAGE_TABLE_COLUMN_SEGMENTS and DISCOVER_STORAGE_TABLE_COLUMNS.

Why you should use DAX Studio for performance testing

I also recommend using DAX Studio to run queries when performance testing, for a number of reasons. First of all it makes it easy to clear the Vertipaq engine cache before a query is run with the “Clear Cache” and “Clear on Run” (which automatically clears the cache before each query) buttons. This not only runs a Clear Cache command, to clear the Vertipaq cache, it also runs a (hidden) DAX query that does not query any data from the dataset but which does trigger the creation of all the measures on the dataset. In most cases this is very fast, but if you have thousands of measures it could take a few seconds (similar to what I show here for report-level measures). If you are not using DAX Studio you can achieve the same result by running a query like:

EVALUATE {1}

DAX Studio also lets you run the same query multiple times using its “Run benchmark” feature (although this only lets you test for states (3) and (4) at the time of writing) and its “Server Timings” feature is invaluable for understanding what’s going on inside the Vertipaq engine when a query runs.

Also make sure you are running the very latest version of DAX Studio (which is 3.0.8 at the time of writing) to make sure it works properly with Fabric.

Performance testing methodology

So, putting this all together, in order to run a single performance test on a Direct Lake dataset to capture performance for states (1), (3) and (4):

  1. Preparation
    • Create a custom Power BI dataset in Fabric (not everything here works with a default dataset at the time of writing)
    • Open your report in Power BI Desktop connected to your published Direct Lake dataset
    • Capture the DAX queries generated by the visuals you want to test using Performance Analyzer by clicking Copy Query (see here for how to do this)
    • Install the very latest version of DAX Studio and open it
    • Connect DAX Studio to your workspace’s XMLA Endpoint (see here for how to do this)
    • Paste your DAX query into DAX Studio
    • Turn on DAX Studio’s Server Timings option
    • Ensure the “Clear on Run” option in the ribbon is turned off
  2. To test performance for state (1):
    • Refresh your dataset to ensure everything is paged out of memory
    • Click the “Clear Cache” button on the ribbon in DAX Studio
    • Run the DAX query
    • Save the output of the Server Timings pane in DAX Studio by clicking the Export button
  3. To test performance for state (3), immediately after the previous steps:
    • Click the “Clear Cache” button on the ribbon in DAX Studio
    • Run the DAX query
    • Save the output of the Server Timings pane in DAX Studio
  4. To test the performance for state (4), immediately after the previous steps:
    • Run the DAX query again without clicking the “Clear Cache” button
    • Save the output of the Server Timings pane

Fallback to DirectQuery

Even with a Direct Lake dataset there is no guarantee that your query will be answered in Direct Lake mode: in as-yet not fully documented scenarios (but basically if your data volumes are too large for the capacity you’re using) Power BI will switch to using DirectQuery mode against the Lakehouse’s SQL Endpoint. One of the objectives of your performance testing should be to make sure that this happens as infrequently as possible because DirectQuery mode will perform noticeably worse than Direct Lake mode.

You may notice some DirectQuery events even when the query itself only uses Direct Lake; these are used to get metadata or security information from the Lakehouse and can be ignored. Here’s an example of this:

Load testing

Testing for a single user is important, but don’t forget about testing performance with multiple concurrent users. As I discuss here, realistic load tests are the only way to get a good idea of how your report will actually perform in production and you can load test a Direct Lake dataset in exactly the same way as an Import mode or DirectQuery dataset.

Direct Lake is still in preview!

The last point to make is that like the rest of Fabric, Direct Lake is still in preview. This not only means that functionality is missing, it also means that performance is not yet as good as it will be yet. So, by all means test Direct Lake and tell us how fast (or not) it is, but be aware that your test results will be out of date very quickly as the engine evolves.

[Thanks to Krystian Sakowski for the information in this post]

DAX · Performance Tuning · Power BI · Power BI Desktop

The “Visual Has Exceeded The Available Resources” Error In Power BI Desktop

One of my most popular blog posts of the last few years is this one on the “Visual has exceeded available resources” error in the Power BI Service:

https://blog.crossjoin.co.uk/2020/01/20/visual-has-exceeded-the-available-resources-error-power-bi/

This error only used to appear in the Power BI Service, but the good news is – and trust me, this is good news – it may now appear in Power BI Desktop too following the May 2023 release.

First, a quick recap of what causes this error. The Power BI Service has finite resources so we at Microsoft don’t allow you to run queries there that take hours to complete or consume vast amounts of memory; we impose limits on query duration and memory usage, and you’ll see this error in the Power BI Service when you hit these limits. The problem has been that, up until recently, these limits were not imposed in Power BI Desktop so it was easy to develop inefficient reports and datasets that worked ok (if a little slowly) on the developer’s PC but then caused errors after they were published. What has changed is that these limits are now imposed in Power BI and they are also configurable there.

The announcement of this change is here:

https://powerbi.microsoft.com/en-us/blog/power-bi-may-2023-feature-summary/#post-23312-_Toc134425717

How do you know if you are running into these limits? You’ll see an error in your visual in Power BI Desktop that looks like this:

The error message is:

Visual has exceeded the available resources

If you click on the “See details” link you’ll see the following dialog:

Resources Exceeded
This visual has exceeded the available resources. Try filtering to decrease the amount of data displayed.

What should you do if you encounter this error? The most important thing is to understand why it’s happening. There are two possible causes:

  1. Your query is taking too long to run. For the Power BI Service, the default query timeout is 225 seconds although it is possible for an admin to reduce this; unless you’ve set a custom limit or you’re not using a Power BI dataset as a source, it’s likely that 225 seconds is the longest that a query will run for in Power BI Desktop.
  2. Your query is using too much memory. This is probably because you are doing an inefficient DAX calculation on a very large table (filtering on entire table rather than a single column is a classic anti-pattern, for example).

As a result you need to do some tuning. “But Chris!”, you say, “my query has to run for longer than 225 seconds! It’s too difficult to tune!” To which I say – tough, you don’t have a choice. Slow, inefficient queries result in unhappy users so if you don’t fix these issues you have an even bigger problem. This video of Marco’s is a good place to start if you want to learn about performance tuning.

In order to do that tuning though (or if you just want to get stuff done before you do any tuning, or if you think the wrong limits are being imposed) you’ll need to turn off the limits so you can capture the DAX query for the offending visual using Performance Analyzer. To do this, go to File/Options and settings/Options to open the Options dialog, go to the Report settings pane and scroll to the bottom to see the Query limit simulations section:

You can either use the “Custom limits” option, as shown in the screenshot above, to set your own limits (enter 0 in one of these boxes for no limits to be imposed) or use the “No query limits” for no limits. You should only use these options temporarily, remember, otherwise you run the risk of getting this error in the Power BI Service later on!

It’s worth mentioning that the limits imposed in Power BI Desktop are only imposed on queries generated by Power BI Desktop itself, which means that they won’t affect external tools like DAX Studio that can also be used to query a dataset in Power BI Desktop. You can see how the limits are imposed by running a Profiler trace on Power BI Desktop, finding the Query Begin event for each query and looking for the Timeout and DbpropMsmdRequestMemoryLimit properties in the Property List shown under the query text:

Also, these settings are saved on a per-file basis, so if you create a new .pbix file it will have the default settings and not the settings you made in any other .pbix file.

DirectQuery · Performance Tuning · Power BI

Power BI DirectQuery Best Practices Video

Here’s a recording of a session I did for the Manchester (UK) Power BI user group recently on best practices for DirectQuery mode in Power BI:

I’ve done it for a few other groups over the last six months but this is the latest and best version, I think.

I’ve worked with several customers using DirectQuery mode since I joined the CAT team and learned a lot along the way. Some of this knowledge has been written up by me (eg this post on Snowflake DirectQuery mode best practices), some of it by the people who work with the data sources Power BI runs on top of (see posts by Dany Hoter on the Azure Data Explorer blog, for example; there’s also going to be a lot of new material on DirectQuery for Databricks coming soon, with this post as a start). There’s a lot of detailed information in the docs too, for example here, here and here.

But remember folks: the most important piece of advice around DirectQuery is to not use it unless you’re really, really, really sure you have no other option. It’s possible to make it work well but it takes a lot more tuning and specialist skill than Import mode!