Direct Lake · Fabric · Refresh

What Does It Mean To Refresh A Direct Lake Power BI Dataset In Fabric?

If you’ve heard about the new Direct Lake mode for Power BI datasets in Fabric you’ll know that it gives you the query performance of Import mode (well, almost) without the need to actually import any data. Direct Lake datasets can be refreshed though – in fact, they refresh automatically by default – and if you look at the dataset’s Refresh History you’ll see there’s a Direct Lake section which sometimes shows errors:

Also, if you look at a custom dataset’s Settings page (although not yet for a default dataset) you’ll see some properties to control refresh too:

So what does it actually mean to refresh a Direct Lake dataset if it doesn’t involve loading data into the dataset?

The section on Direct Lake refresh in the Fabric docs has the following information:

Invoking a refresh for a Direct Lake dataset is a low cost operation where the dataset analyzes the metadata of the latest version of the Delta Lake table and is updated to reference the latest files in the OneLake.

Let’s see what this means using a simple example. I built a Dataflow Gen2 that loads a single row of data into a table in a Fabric Lakehouse with two columns: one called Sales that always contains the value 1 and one called LoadDate that contains the date and time the dataflow ran:

I ran the dataflow once to load a row of data into a table called MyTable in a Lakehouse:

I then built a custom dataset (because I want to change those refresh options mentioned above) consisting of just this table, and finally a report showing the contents of the table in the dataset:

I then connected SQL Server Profiler to the custom dataset via the XMLA Endpoint and started a trace to capture the Command Begin/End and Progress Report Begin/End events, and refreshed the dataflow (and only the dataflow) to load another row of data into the Lakehouse. Soon after the dataflow refresh finished, the Profiler trace showed a dataset refresh started automatically:

Refreshing the report showed the second row that had just been loaded:

This shows that, with the default settings, a Direct Lake dataset is automatically refreshed when data is loaded into a Lakehouse.

I then went to the Settings pane for the dataset and turned off the “Keep your Direct Lake data up to date” property:

I then ran the dataflow again and this time the Profiler trace showed that no automatic refresh took place; the new row was not shown in the report either. Manually refreshing the dataset from the workspace did result in the new row appearing in the report:

Next, I used a Notebook to delete all the rows from the table:

At this point the report still showed the three rows displayed in the previous screenshot. Finally, I refreshed the dataset one more time and all the data disappeared from the report:

It’s important to stress that the reason that the refresh is needed to show the latest data in the table is not because the data is being loaded into the dataset. It’s because, as the docs say, refresh tells the dataset to look at the latest version of the data in the table – which leads on to the whole topic of time travel in Delta tables in Fabric. Dennes Torres has a nice blog post on this subject here which is a great place to start.

Why would you ever want to refresh a Direct Lake dataset manually? Again, the docs have the answer:

You may want to disable [refresh] if, for example, you need to allow completion of data preparation jobs before exposing any new data to consumers of the dataset. 

So, let’s say you need to load some new data to your table and also delete or update some data that’s already there and this needs to be done as several separate jobs. It’s very likely that you don’t want your Power BI reports to show any of the new data until all of these jobs have been completed, so to do this you will want to disable automatic refresh and do a manual dataset refresh as the last step of your ETL.

Log Analytics · Performance Tuning · Power BI · Refresh

Exploring Power BI Import Mode Throughput With Log Analytics, Part 3: Throughput Variations During A Refresh

In the second post in this series I discussed a KQL query that can be used to analyse Power BI refresh throughput at the partition level. However, if you remember back to the first post in this series, it’s actually possible to get much more detailed information on throughput by looking at the ProgressReportCurrent event, which fires once for every 10000 rows read during partition refresh.

Here’s yet another mammoth KQL query that you can use to analyse the ProgressReportCurrent event data:

PowerBIDatasetsWorkspace
| where TimeGenerated > ago(1d)
| where OperationName == "ProgressReportCurrent" or OperationName=="ProgressReportBegin" or OperationName == "ProgressReportEnd"
| where XmlaObjectPath!=""
| where OperationDetailName=="ReadData"
| project OperationName, XmlaObjectPath, EventText, TimeGenerated, ProgressCounter, OperationDetailName, XmlaRequestId, ObjectPathRequestId = strcat(XmlaObjectPath ,XmlaRequestId)
| partition by ObjectPathRequestId
(
    sort by TimeGenerated asc
    | scan declare( PrevEventTimeGenerated:datetime, TimeSincePrevEvent:timespan=time(0s), TimeSinceRefreshBegin:timespan=time(0s), NumberOfRowsRead:int=0) with 
    (
        step RefreshStarts: OperationName=="ProgressReportBegin";
        step RefreshInProgress: OperationName == "ProgressReportCurrent" =>
            PrevEventTimeGenerated = iff(ProgressCounter==10000, RefreshStarts.TimeGenerated, RefreshInProgress.TimeGenerated),
            TimeSincePrevEvent = iff(ProgressCounter==10000, TimeGenerated - RefreshStarts.TimeGenerated, TimeGenerated-RefreshInProgress.TimeGenerated),
            TimeSinceRefreshBegin = TimeGenerated - RefreshStarts.TimeGenerated,
            NumberOfRowsRead = iff(ProgressCounter==10000, 10000, ProgressCounter-RefreshInProgress.ProgressCounter);
        step RefreshEnds: OperationName == "ProgressReportEnd" => 
            PrevEventTimeGenerated = RefreshInProgress.TimeGenerated,
            TimeSincePrevEvent = TimeGenerated - RefreshInProgress.TimeGenerated,
            TimeSinceRefreshBegin = TimeGenerated - RefreshStarts.TimeGenerated,
            NumberOfRowsRead = ProgressCounter - RefreshInProgress.ProgressCounter;
    )
    | where OperationName == "ProgressReportCurrent" or OperationName == "ProgressReportEnd"
)
| project XmlaObjectPath, Table = split(XmlaObjectPath,".", 2)[0], Partition = split(XmlaObjectPath,".", 3)[0], OperationName, 
StartTime = format_datetime(PrevEventTimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), EndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), 
ProgressCounter, SecondsSincePreviousEvent = TimeSincePrevEvent/1s, SecondsSinceRefreshBegin=TimeSinceRefreshBegin/1s,NumberOfRowsRead, 
NumberOfRowsReadPerSecond=NumberOfRowsRead/(TimeSincePrevEvent/1s), XmlaRequestId

Here’s a brief explanation of what it does:

  • It filters the Log Analytics data down to get events from the last day and just the ProgressReportCurrent events, as well as the ProgressReportBegin/End events which are fired before and after the ProgressReportCurrent events.
  • It then splits the data into groups of rows (‘partitions’ in KQL, but of course not the partitions that are being refreshed) by a combination of XmlaRequestId (ie the refresh operation) and XmlaObjectPath (ie the partition that is being refreshed)
  • For each group of rows it will then:
    • Find the ProgressReportBegin event and from this get the time when data started to be read from the source
    • Get all subsequent ProgressReportCurrent events and calculate the amount of time elapsed since the previous event (which might be the ProgressReportBegin event or a previous ProgressReportCurrent event) and the number of rows read
    • When the ProgressReportEnd event is encountered, calculate the amount of time elapsed since the previous ProgressReportCurrent event and the number of rows (which will be less than 10000) read since then
  • Filter out the ProgressReportBegin events because we don’t need them any more
  • Finally, add columns that splits out the table name and partition name and calculates the number of rows read per second for each row by dividing the number of rows read for each event by the amount of time elapsed since the previous event

What can this query tell us about throughput?

First of all, something interesting but not necessarily useful. At least for the data source I’m using for my tests, when I plot a column chart with the number of rows read on the x axis and the amount of time elapsed since the last event on the y axis (ie the amount of time it takes to read 10000 rows for all but the last column) then I noticed that every 200000 rows something happens to slow down the read:

I have no idea what this is, whether it’s a quirk of this particular source or connector, but it’s a great example of the kind of patterns that become obvious when you visualise data rather than look at a table of numbers.

Plotting time on the x axis of a line chart and the cumulative total of rows read on the y axis gives you something more useful. Here’s the chart for one of the refreshes mentioned in my last post where four partitions of the same table are refreshed in parallel:

In this case throughput is fine up until the end of the refresh at which point something happens to the February, March and April partitions but not the January partition to slow them down for about 30 seconds, after which throughput goes back to what it was before. Here’s the same chart zoomed in a bit:

Here’s the same problem shown in the first graph above, where the number of rows read is on the x axis, showing how for example with the April partition there’s a sudden spike where it takes 14 seconds to read 10000 rows rather than around 0.3 seconds:

What is this, and why isn’t the January partition affected? Maybe it was a network issue or caused by something happening in the source database? Looking at another refresh that also refreshes the same four partitions in parallel, it doesn’t seem like the same thing happens – although if you look closely at the middle of the refresh there might be a less pronounced flattening off:

Again, the point of all this is not the mysterious blips I’ve found in my data but the fact that if you take the same query and look at your refreshes, you may find something different, something more significant and something you can explain and do something about.

Log Analytics · Performance Tuning · Power BI · Refresh

Exploring Power BI Import Mode Throughput With Log Analytics, Part 2: High-Level Statistics

In the first post in this series I described the events in Log Analytics that can be used to understand throughput – the speed that Power BI can read from your dataset when importing data from it – during refresh. While the individual events are easy to understand when you look at a simple example they don’t make it easy to analyse the data in the real world, so here’s a KQL query that takes all the data from all these events and gives you one row per partition per refresh:

//Headline stats for partition refresh with one row for each partition and refresh
//Get all the data needed for this query and buffer it in memory
let RowsForStats = 
materialize(
  PowerBIDatasetsWorkspace
| where TimeGenerated > ago(1d)
| where OperationName == "ProgressReportEnd"
| where OperationDetailName == "ExecuteSql" or OperationDetailName == "ReadData" 
or (OperationDetailName == "TabularRefresh" and (EventText contains "partition"))  
);
//Get just the events for the initial SQL execution phase
let ExecuteSql = 
RowsForStats
| where OperationDetailName == "ExecuteSql"
| project XmlaRequestId, XmlaObjectPath, 
ExecuteSqlStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ), 
ExecuteSqlEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), 
ExecuteSqlDurationMs = DurationMs, ExecuteSqlCpuTimeMs = CpuTimeMs;
//Get just the events for the data read and calculate rows read per second
let ReadData =
RowsForStats
| where OperationDetailName == "ReadData"
| project XmlaRequestId, XmlaObjectPath, 
ReadDataStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ), 
ReadDataEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), 
ReadDataDurationMs = DurationMs, ReadDataCpuTime = CpuTimeMs, 
TotalRowsRead = ProgressCounter, RowsPerSecond = ProgressCounter /(toreal(DurationMs)/1000);
//Get the events for the overall partition refresh
let TabularRefresh =
RowsForStats
| where OperationDetailName == "TabularRefresh"
| parse EventText with * '[MashupCPUTime: ' MashupCPUTimeMs:long ' ms, MashupPeakMemory: '  MashupPeakMemoryKB:long ' KB]'
| project XmlaRequestId, XmlaObjectPath, 
TabularRefreshStartTime = format_datetime(TimeGenerated - (DurationMs * 1ms),'yyyy-MM-dd HH:mm:ss.fff' ), 
TabularRefreshEndTime = format_datetime(TimeGenerated,'yyyy-MM-dd HH:mm:ss.fff' ), 
TabularRefreshDurationMs = DurationMs, TabularRefreshCpuTime = CpuTimeMs, 
MashupCPUTimeMs, MashupPeakMemoryKB;
//Do an inner join on the three tables so there is one row per partition per refresh
ExecuteSql
| join kind=inner ReadData on XmlaRequestId, XmlaObjectPath
| join kind=inner TabularRefresh on XmlaRequestId, XmlaObjectPath
| project-away XmlaRequestId1, XmlaRequestId2, XmlaObjectPath1, XmlaObjectPath2
| extend Table = tostring(split(XmlaObjectPath,".", 2)[0]), Partition = tostring(split(XmlaObjectPath,".", 3)[0])
| project-reorder XmlaRequestId, Table, Partition
| order by XmlaRequestId, ExecuteSqlStartTime desc

It’s a bit of a monster query but what it does is quite simple:

  • First it gets all the events relating to partition refresh in the past 1 day (which of course you can change) and materialises the results.
  • Then it filters this materialised result and gets three sets of tables:
    • All the ExecuteSql events, which tell you how long the data source took to start returning data and how much CPU time was used.
    • All the ReadData events, which tell you how long Power BI took to read all the rows from the source after the data started to be returned, how much CPU time was used, and how many rows were read. Dividing duration by rows read lets you calculate the number of rows read per second during this phase.
    • All the TabularRefresh events, which give you overall data on how long the partition refresh took, how much CPU time was used, plus information on Power Query peak memory usage and CPU usage.

What can this tell us about refresh throughput though? Let’s use it to answer some questions we might have about throughput.

What is the impact of parallelism on throughput? I created a dataset on top of the NYC taxi data Trip table with a single table, and in that table created four partitions containing data for January, February, March and April 2013, each of which contained 13-15 million rows. I won’t mention the type of data source I used because I think it distracts from what I want to talk about here, which is the methodology rather than the performance characteristics of a particular source.

I then ran two refreshes of these four partitions: one which refreshed them all in parallel and one which refreshed them sequentially, using custom TSML refresh commands and the maxParallelism property as described here. I did a refresh of type dataOnly, rather than a full refresh, in the hope that it would reduce the number of things happening in the Vertipaq engine during refresh that might skew my results. Next, I used the query above as the source for a table in Power BI (for details on how to use Log Analytics as a source for Power BI see this post; I found it more convenient to import data rather than use DirectQuery mode though) to visualise the results.

Comparing the amount of time taken for the SQL query used to start to return data (the ExecuteSqlDurationMs column from the query above) for the four partitions for the two refreshes showed the following:

The times for the four partitions vary a lot for the sequential refresh but are very similar for the parallel refresh; the January partition, which was refreshed first, is slower in both cases. The behaviour I described here regarding the first partition refreshed in a batch could be relevant.

Moving on to the Read Data phase, looking at the number of rows read per second (the RowsPerSecond column from the query above) shows a similar pattern:

There’s a lot more variation in the sequential refresh. Also, as you would expect, the number of rows read per second is much higher when partitions are refreshed sequentially compared to when they are refreshed in parallel.

Looking at the third main metric, the overall amount of time taken to refresh each partition (the TabularRefreshDurationMs column from the query above) again shows no surprises:

Each individual partition refreshes a lot faster in the sequential refresh – almost twice as fast – compared to the parallel refresh. Since four partitions are being refreshed in parallel during the second refresh, though, this means that any loss of throughput for an individual partition as a result of refreshing in parallel is more than compensated for by the parallelism, making the parallel refresh faster overall. This can be shown using by plotting the TabularRefreshStartTime and TabularRefreshEndTime columns from the query above on a timeline chart (in this case the Craydec Timelines custom visual) for each refresh and each partition:

On the left of the timeline you can see the first refresh where the partitions are refreshed sequentially, and how the overall duration is just over 20 minutes; on the right you can see the second refresh where the partitions are refreshed in parallel, which takes just under 10 minutes. Remember also that this is just looking at the partition refresh times, not the overall time taken for the refresh operation for all partitions, and it’s only a refresh of type dataOnly rather than a full refresh.

So does this mean more parallelism is better? That’s not what I’ve been trying to say here: more parallelism is better for overall throughput in this test but if you keep on increasing the amount of parallelism you’re likely to reach a point where it makes throughput and performance worse. The message is that you need to test to see what the optimal level of parallelism – or any other factor you can control – is for achieving maximum throughput during refresh.

These tests only show throughput at the level of the ReadData event for a single partition, but as mentioned in my previous post there is even more detailed data available with the ProgressReportCurrent event. In my next post I’ll take a closer look at that data.

[Thanks to Akshai Mirchandani for providing some of the information in this post, and hat-tip to my colleague Phil Seamark who has already done some amazing work in this area]

Log Analytics · Performance Tuning · Power BI · Refresh

Exploring Power BI Import Mode Throughput With Log Analytics, Part 1: The Events

If you’re tuning refresh for a Power BI Import mode dataset one of the areas you’ll be most interested in is throughput, that is to say how quickly Power BI can read data from the data source. It can be affected by a number of different factors: how quickly the data source can return data; network latency; the efficiency of the connector you’re using; any transformations in Power Query; the number of columns in the data and their data types; the amount of other objects in the same Power BI dataset being refreshed in parallel; and so on. How do you know if any or all of these factors is a problem for you? It’s a subject that has always interested me and now that Log Analytics for Power BI datasets is GA we have a powerful tool to analyse the data, so I thought I’d do some testing and write up my findings in a series of blog posts.

The first thing to understand is what events there are in Log Analytics that provide data on throughput. The events to look at are ProgressReportBegin, ProgressReportCurrent and ProgressReportEnd (found in the OperationName column), specifically those with OperationDetailName ExecuteSql, ReadData and Tabular Refresh. Consider the following KQL query that looks for at this data for a single refresh and for a single partition:

let RefreshId = "e5edc0de-f223-4c78-8e2d-01f24b13ccdc";
let PartitionObjectPath = "28fc7514-d202-4969-922a-ec86f98a7ea2.Model.TIME.TIME-ffca8cb8-1570-4f62-8f04-993c1d1d17cb";
PowerBIDatasetsWorkspace
| where TimeGenerated > ago(3d)
| where XmlaRequestId == RefreshId
| where XmlaObjectPath == PartitionObjectPath
| where OperationDetailName == "ExecuteSql" or OperationDetailName == "ReadData" or OperationDetailName == "TabularRefresh"
| project XmlaObjectPath, Table = split(XmlaObjectPath,".", 2)[0], Partition = split(XmlaObjectPath,".", 3)[0], 
TimeGenerated, OperationName, OperationDetailName, EventText, DurationMs, CpuTimeMs, ProgressCounter
| order by XmlaObjectPath, TimeGenerated asc;

Some notes on what this query does:

  • All the data comes from the PowerBIDatasetsWorkspace table in Log Analytics
  • I’ve put an arbitrary filter on the query to only look for data in the last three days, which I know contains the data for the specific refresh I’m interested in
  • An individual refresh operation can be identified by the value in the XmlaRequestId column and I’m filtering by that
  • An individual partition being refreshed can be identified by the value in the XmlaObjectPath column and I’m filtering by that too
  • The value in the XmlaObjectPath column can be parsed to obtain both the table name and partition name
  • The query filters the events down to those mentioned above

Here are some of the columns from the output of this query, showing data for the refresh of a single table called TIME with a single partition:

Some more notes on what we can see here:

  • The TimeGenerated column gives the time of the event. Although the time format shown here only shows seconds, it actually contains the time of the event going down to the millisecond level – unlike in a Profiler trace, where time values are rounded to the nearest second and are therefore lot less useful.
  • The first event returned by this query is a ProgressReportBegin event of type TabularRefresh which marks the beginning of the partition refresh.
  • As I blogged here, after that are a pair of ProgressReportBegin/End events of type ExecuteSql. The value in the DurationMs column tells you how long it takes for the data source (which includes time taken by the actual query generated against the data source and any Power Query transformations) to start to return data – which was, in this case, 6016 milliseconds or 6 seconds.
  • Next there is a ProgressReportBegin event which indicates the beginning of data being read from the source.
  • After that there are a series of ProgressReportCurrent events which mark the reading of chunks of 10000 rows from the source. The ProgressCounter column shows the cumulative number of rows read.
  • Next there is a ProgressReportEnd event that marks the end of the data read. The ProgressCounter shows the total number of rows read (which must be less than 10000 rows greater than the value in the ProgressCounter column for the previous ProgressReportCurrent event); the DurationMs column shows the total time taken to read the data. In this case 86430 rows of data were read in 1.4 seconds.
  • Finally there is a ProgressReportEnd event of type TabularRefresh, the pair of the first event shown returned. It not only shows the total amount of time taken and the CPU time used to refresh the partition (which includes other operations that are nothing to do with throughput such as compressing data and building dictionaries) in the DurationMs column, as I blogged here it also includes data on the CPU and peak memory used by Power Query during the refresh. In this case the total refresh time was 7.6 seconds, so not much more than the time taken to read the data.

Clearly there’s a lot of useful, interesting data here, but how can we analyse it or visualise it? Stay tuned for my next post…

Fabric · Log Analytics · Power BI · Refresh

Monitoring Power BI Dataset Refresh Memory And CPU Usage With Log Analytics

Maybe the fourth- or fifth-most exciting Power BI-related announcement last month (admittedly it was an exciting month) was that Log Analytics for Power BI datasets is now GA and you can now link multiple Power BI workspaces to a single Log Analytics workspace. This, for me, means that enabling Log Analytics has gone from being useful to essential for anyone interested in monitoring Analysis Services engine activity in an enterprise Power BI/Fabric deployment. It also works with Direct Lake datasets too!

Anyway, following on from my recent posts on measuring memory and CPU usage during dataset refresh, both for the refresh as a whole and for individual partitions, here are some KQL queries that allow you to extract that information from Log Analytics.

Here’s the KQL query to get the memory and CPU usage for a refresh (see this post for details):

PowerBIDatasetsWorkspace
| where OperationName == "CommandEnd"
| parse-where EventText with * '[PeakMemory: ' PeakMemoryKB:long ' KB, MashupCPUTime: ' MashupCPUTimeMs:long ' ms, MashupPeakMemory: '  MashupPeakMemoryKB:long ' KB]'
| extend DatasetName=ArtifactName
| project TimeGenerated, CpuTimeMs, DurationMs, MashupCPUTimeMs, MashupPeakMemoryKB, PeakMemoryKB, XmlaRequestId, DatasetName, PowerBIWorkspaceName

And here’s the KQL query to get the memory and CPU usage for individual partitions/Power Query queries (see this post for details):

PowerBIDatasetsWorkspace
| where OperationName == "ProgressReportEnd"
| parse-where EventText with * '[MashupCPUTime: ' MashupCPUTimeMs:long ' ms, MashupPeakMemory: '  MashupPeakMemoryKB:long ' KB]'
| extend Table=split(XmlaObjectPath,'.').[2], Partition=split(XmlaObjectPath,'.').[3]
| extend DatasetName=ArtifactName
| project TimeGenerated, Table, Partition, CpuTimeMs, DurationMs, MashupCPUTimeMs, MashupPeakMemoryKB, XmlaRequestId, DatasetName, PowerBIWorkspaceName

You could use these queries in a Power BI DirectQuery dataset to report on all your refreshes, similar to what I did in this series of posts last year. The XmlaRequestId columns contains the unique identifier for the refresh operation so you could build a one-to-many relationship between the table with the refresh-level data and the table with the partition-level data. Maybe I’ll get round to building that report sometime…

[Thanks to Jast Lu, who wrote the original version of these queries]

Power BI · Refresh

Power BI Automatic Refresh Retries

One of the most confusing things about troubleshooting Power BI refresh problems is the way Power BI will sometimes try running a refresh again after it has failed. It means that refreshes seem to take a lot longer than you would expect and also that more queries are run against your data source than you would expect. I don’t have all the answers but I thought it would be useful to highlight a few scenarios where it does happen.

A while ago I wrote a series of posts on dataset refresh timeouts which started out with this post on the fact that no refresh in the Power BI portal can last more than 2 hours (in Shared/Pro capacity) or 5 hours (in Premium), as documented here. While writing that post I built a dataset that take more than 5 hours to refresh; back then, and testing it again now, I found that it seemed to take a lot longer than 5 hours to fail.

What happened? I kicked off the refresh at 07:33 UTC one morning. With Log Analytics enabled on the workspace I could run the following KQL query to see what happened five hours later:

PowerBIDatasetsWorkspace
| where TimeGenerated>datetime(2023-05-05 12:32:00) and TimeGenerated<datetime(2023-05-05 12:35:00)
| where OperationName in ('CommandBegin', 'CommandEnd', 'Error', 'ProgressReportBegin', 'ProgressReportEnd', 'ProgressReportError')
| where OperationDetailName<>'ExtAuth'
| project TimeGenerated, OperationName, OperationDetailName, DurationMs, CpuTimeMs, EventText, Status, StatusCode
| order by TimeGenerated asc

Sure enough, around 12:34 UTC I could see errors happening for the existing refresh, then (on line 22 in the screenshot below) a CommandEnd event corresponding to the refresh that failed and which had run for five hours. After that (on line 25 in the screenshot below) there was a CommandBegin event for a new refresh on the same dataset:

Interestingly, even though I could see from Log Analytics that the refresh had restarted and I could also see the refresh “spinny” spinning in the workspace, the Refresh History dialog was already showing a timeout:

[I’m told that we have plans to improve this and provide more details on what’s happening in the UI in the future]

The same thing happened again with the next refresh failing at 17:34 UTC that day, five hours later, again at 23:36 UTC, again at 03:46 UTC the next day; there was then a delay of one hour when the refresh started again, at 04:35 UTC and the final failure came one hour after that at 05:37 UTC. Lots of failures and lots of retries.

I happened to capture a similar refresh retry in a Profiler trace recently. I scheduled a refresh for 23:00 UTC on a busy Premium capacity and saw the following:

This trace shows:

  • No activity before 23:05, which I think shows how busy the capacity was at the time
  • The first attempted refresh kicked off at 23:05:10 (the first highlighted pair of Command Begin/End events in the screenshot) and failed immediately.
  • The error message associated with this failure was “You’ve exceeded the capacity limit for dataset refreshes. Try again when fewer datasets are being processed”, an error documented here. This was caused by reaching the limit for the number of concurrent datasets that can be refreshed on a single capacity (as documented in the Model Refresh Parallelism column of the table here).
  • The dataset refresh kicked off again at 23:06:10 and finished successfully after 42 seconds (the second highlighted pair of Command Begin/End events in the screenshot above).

So two scenarios where Power BI retried a refresh after an initial failure. As I said, though, not all refresh failures result in retries. For example, when I tried refreshing the dataset in this post which fails with a timeout on the SQL Server side, I could see that Power BI only ran the refresh command once:

I don’t know when the Power BI Service does and doesn’t retry a refresh after a failure but I’m trying to find out; I’ll blog again if I do get more information.

While you don’t have any control over retry behaviour if you schedule a refresh in the Power BI Service or kick off a manual refresh from the portal, you do get more control if you refresh using the Enhanced Refresh API and set the retryCount parameter. Indeed the docs for the Enhanced Refresh API have some extra detail about how retryCount and the five hour timeout interact here, mentioning that even if you keep retrying a refresh it will always fail after 24 hours and that the total time for a refresh includes the time taken for all failures before a success. This, for me, is one more reason to use the Enhanced Refresh API for refreshing enterprise-level datasets.

Log Analytics · Performance Tuning · Power BI · Power Query · Refresh

Measuring Memory And CPU Usage In Power BI During Dataset Refresh

This post is a follow-up to my recent post on identifying CPU and memory-intensive Power Query queries in Power BI. In that post I pointed out that Profiler and Log Analytics now gives you information on the CPU and memory used by an individual Power Query query when importing data into Power BI. What I didn’t notice when I wrote that post is that there is also now information available in Profiler and Log Analytics that tells you about peak memory and CPU usage across all Power Query queries for a single refresh in the Power BI Service, as well as memory usage for the refresh as a whole.

Using the same dataset from my previous post, I ran a Profiler trace on the workspace and captured the Command Begin and Command End events while I refreshed the dataset. Here’s what Profiler shows for the Command End event:

In the upper pane, the Duration column tells you how long the refresh took in milliseconds and the CPUTime column tells you how much CPU was used by both the Analysis Services engine and the Power Query engine during the refresh. This is not new, and I wrote about the CPUTime column last year here.

In the lower pane where the TMSL for the refresh operation is shown – this is the text from the TextData column – underneath the TMSL there are three new values shown:

  • PeakMemory shows the maximum amount of memory used during the refresh operation. This includes memory used by the Analysis Services engine and the Power Query engine.
  • MashupCPUTime shows the total amount of CPU used by the Power Query engine for all Power Query queries used to import data into the dataset. This value will always be less than the value shown in the CPUTime column in the upper pane.
  • MashupPeakMemory shows the maximum amount of memory used during the refresh operation by just the Power Query engine across all query evaluations. It’s important to note that this value may not be entirely accurate since the memory usage values, and therefore the peaks, are captured asynchronously so there could be higher peaks that are not detected.

This new information should be extremely useful for troubleshooting refresh failures that are caused by excessive memory consumption.

[Thanks to Akshai Mirchandani for this information]

Log Analytics · M · Performance Tuning · Power BI · Power Query · Refresh

Identifying CPU And Memory-Intensive Power Query Queries During Refresh In The Power BI Service

Last year I wrote a post about a change in the Power BI Service that meant the CPU Time associated with Power Query transformations was added to the total shown for a dataset refresh operation in Profiler traces and Log Analytics:

https://blog.crossjoin.co.uk/2022/07/03/measuring-power-query-cpu-usage-during-power-bi-dataset-refresh/

This was useful, but it didn’t tell you directly how much CPU Time was used by Power Query and it didn’t tell you which tables or partitions in a refresh were using the most CPU. It also didn’t tell you anything about Power Query memory usage. The good news that recently there has been another change that solves these problems.

Let’s say you have a Power BI dataset that consists of a single table whose source is the following Power Query query:

let
  Source = #table(type table [MyNumber = number], List.Transform({1 .. 1000000}, each {_})), 
  #"Added Custom" = Table.AddColumn(
    Source, 
    "ARandomNumber", 
    each Number.RandomBetween(0, 10000), 
    type number
  ), 
  #"Sorted Rows" = Table.Sort(#"Added Custom", {{"ARandomNumber", Order.Ascending}})
in
  #"Sorted Rows"

This query creates a table with a million rows, adds a column with random numbers in and then sorts on that column – which is, as you’d expect, a very CPU and memory-hungry operation.

If you refresh this dataset in the Power BI Service and run a Profiler trace on it, looking at the Command Begin/End and Progress Report Begin/End events, this is what you’ll see:

The final Command End event shows the toal duration of the refresh as well as the amount of CPU used by both the Analysis Services engine and Power Query – in this case 24094ms.

If you look at the Progress Report End event associated with the finish of the refresh for the only partition of the only table in the dataset (highlighted in the screenshot above), there’s some extra information:

It shows the amount of CPU Time and the maximum amount of memory used by the Power Query engine while refreshing this partition. In this case the Power Query engine used 19468ms of CPU and reached a peak of 581848KB of memory. I can tell this is going to be really useful for troubleshooting refresh performance issues and out-of-memory errors.

[Thanks to Akshai Mirchandani, Xiaodong Zhang, Ting-Wei Chang and Jast Lu for this information]

M · OData · Power BI · Power Query · Refresh

Using OData.Feed And The Query Option To Avoid The Dynamic Data Sources Error In Power BI

In my last post I showed how, in many cases, you can avoid the “dynamic data sources” error with OData data sources by taking advantage of query folding. That’s not always possible though and in this post I’ll show you how you can use the Query option of the OData.Feed function to do so instead.

At first glance the Query option of OData.Feed looks very much like the Query option on Web.Contents which, of course, can also be used to avoid the dynamic data sources error (see here and here) and that’s true up to a point: you can use it to add query parameters to your OData URL. However the documentation is not particularly detailed and there is one thing that will confuse you when you try to use it: you can’t use it with OData system query options like $filter. For example, let’s say you wanted to query the People entity in the TripPin sample OData endpoint to get only the people whose first name was Scott. You can do this as follows:

let
  Source = OData.Feed(
    "https://services.odata.org/TripPinRESTierService/People?$filter=FirstName eq 'Scott'",
    null,
    [Implementation = "2.0"]
  )
in
  Source

Since $filter is a url query parameter, you might then try the following code:

let
  Source = OData.Feed(
    "https://services.odata.org/TripPinRESTierService/People",
    null,
    [
      Implementation = "2.0",
      Query = [
        #"$filter"
          = "FirstName eq 'Scott'"
      ]
    ]
  )
in
  Source

Unfortunately, this returns the following error:

Expression.Error: OData.Feed custom query options cannot start with ‘$’.

This is Power Query’s way of telling you you can’t use OData system query options with the OData.Feed Query option, since they always start with a $ sign. This is a deliberate design decision on the part of the Power Query team; I won’t go into the reasons why it is this way, it’s complicated!

However, you can use the Query option with OData custom query options like so:

let
  Source = OData.Feed(
    "https://services.odata.org/TripPinRESTierService/People",
    null,
    [
      Implementation = "2.0",
      Query = [#"debug-mode" = "true"]
    ]
  )
in
  Source

More importantly, you can use the Query option with OData parameter aliases. The original URL we were looking at:

https://services.odata.org/TripPinRESTierService/People?$filter=FirstName eq 'Scott'

…can be rewritten to use parameter aliases like so:

https://services.odata.org/TripPinRESTierService/People?$filter=FirstName eq @fn&@fn='Scott'

…and the rewritten version can be called using the Query option like so:

let
  Source = OData.Feed(
    "https://services.odata.org/TripPinRESTierService/People?$filter=FirstName eq @f", 
    null, 
    [
      Implementation = "2.0", 
      Query = [#"@f" = "'Scott'"]
    ]
  )
in
  Source

You can use parameter aliases to call OData unbound functions too. For example, you can call TripPin’s GetNearestAirport function like so:

let
  Source = OData.Feed(
    "https://services.odata.org/TripPinRESTierService/GetNearestAirport(lat = @latitude, lon = @longitude)", 
    null, 
    [
      Implementation = "2.0", 
      Query = [
        #"@latitude"  = "100", 
        #"@longitude" = "0"
      ]
    ]
  ), 
  #"Converted to Table" = Record.ToTable(
    Source
  )
in
  #"Converted to Table"

Finally, here’s an example of a query that starts with a table of latitudes and longitudes and calls the GetNearestAirport function for each row:

let
  Source = #table(
    type table [
      LocationLatitude = text,
      LocationLongitude = text
    ],
    {
      {"10", "10"},
      {"33", "-118"},
      {"100", "-100"}
    }
  ),
  #"Added Custom" = Table.AddColumn(
    Source,
    "FunctionCall",
    each OData.Feed(
      "https://services.odata.org/TripPinRESTierService/GetNearestAirport(lat = @latitude, lon = @longitude)",
      null,
      [
        Implementation = "2.0",
        Query = [
          #"@latitude" = [
            LocationLatitude
          ],
          #"@longitude" = [
            LocationLongitude
          ]
        ]
      ]
    )
  ),
  #"Expanded FunctionCall"
    = Table.ExpandRecordColumn(
    #"Added Custom",
    "FunctionCall",
    {"Name", "IcaoCode", "IataCode"},
    {
      "FunctionCall.Name",
      "FunctionCall.IcaoCode",
      "FunctionCall.IataCode"
    }
  )
in
  #"Expanded FunctionCall"

When you first publish a dataset with this query in to the Power BI Service you’ll see an internal server error message coming from the TripPin OData service. You can make the dataset refresh successfully and avoid this and the dynamic data sources error though: you need to check the Skip Test Connection box in the credentials dialog you can open from the Settings pane and set the data privacy level on the data source that is sending data to the OData function appropriately too:

Power BI · Refresh

How Much Data Can You Load Into Power BI?

Power BI can handle large data volumes, but just how much data can you load into Power BI? Can Power BI handle big data? How big is “big” anyway? These are questions you may have when you’re starting out on a new Power BI project and the answers can be hard to find. Indeed the answer in most cases is “it depends”, which isn’t very helpful. In this post I will try to explain the various limits on Power BI dataset size and how you can know if you’re likely to hit them.

Power BI storage modes

First of all, what does it even mean to “load data into” Power BI? There are two fundamental ways that Power BI can work with data: Import mode and DirectQuery mode. Different tables in your Power BI dataset can have different storage modes; there’s also an option called Dual mode where a table can switch between Import mode and DirectQuery mode depending on the circumstances.

In DirectQuery mode no data is stored in your Power BI dataset and when your report renders and queries are sent to the dataset, the dataset sends queries back to the data source to get data on demand. This means the only limits on the amount of data you can work with are the limits set by your datasource.

In Import mode Power BI stores a copy of your data inside the dataset in its own internal database engine, known as either the Vertipaq engine or the Analysis Services engine. Import mode is the default storage mode and for good reason – it will almost always give you the best performance for your reports and it allows you to use all Power BI functionality. The “how much data?” question is only relevant to Import mode because when you use it Power BI imposes various limits on the amount of data that can be stored inside a dataset.

How big is your dataset?

In order to answer the question of “how much data can you load into Power BI?” you need to know how to measure the size of your dataset. There are various different ways of doing this but the best way is to install a free, community-developed tool called DAX Studio: its Model Metrics feature shows you the total size of your dataset on the Summary tab:

This figure is good enough for most purposes. It’s not quite that simple though, and this video by Marco Russo and Patrick Leblanc has a great explanation of all the different ways of measuring dataset size and what the figures really represent.

Can you predict how large your dataset is if you know how much data you have in your data source?

You have to import all your data into your dataset to find out how large your dataset is. But can you work out how large your dataset will be before you do this? The answer is no. You can be pretty sure that the size of your dataset will be smaller than the size your data is in your data source but it could only be a bit smaller or it could be a lot smaller, maybe even just 10-20% of the original size. This is because Power BI compresses your data very efficiently when you import it. What’s more, how you model your data can have a big impact on how well the compression works and making a few changes can result in a much smaller dataset as, for example, this post by Nikola Ilic shows.

How much data can you load into Power BI Desktop?

The only practical limit on the amount of data you can load into Power BI Desktop is the amount of memory you have on your PC; you’ll need at least 16GB of RAM, ideally 32GB, to get the best experience. However Power BI Desktop is just a development tool – you’ll need to publish to the Power BI Service for other people to view your report and that’s where the limits take effect. What’s more there’s a 10GB limit on the size of a dataset you can publish to the Power BI Service from Power BI Desktop, although you can have datasets much larger than that in Power BI Premium.

To be honest you should never be working with anything like 10GB of data in Power BI Desktop anyway: the file size will be huge, saving will be slow and you’ll spend a long time waiting for data to import while you’re development. What you should do is work with a small subset of your data in Desktop and only load the full volume after you’ve published. You can do this in a few different ways, for example using incremental refresh or by using M parameters to limit the amount of data you load in Desktop and then changing the parameter after publishing, as I showed here.

How much data can you load in the Power BI Service if you’re using Shared capacity (aka “Power BI Pro”)?

If you are not using Power BI Premium or Premium Per User, when you publish to the Power BI Service you are using Shared capacity (often called “Power BI Pro” by Power BI users because you only need a Power BI Pro licence to use it). The maximum size of a dataset in Shared capacity is 1GB; if you hit that limit you’ll get the “Unable to save the changes” error I blogged about here. There is one exception though: as mentioned here, if you use the rather obscure option in the Power BI Service to upload an Excel workbook with data in the Excel Data Model/Power Pivot, then the resulting dataset is limited to 250MB.

How much data can you load in the Power BI Service if you’re using Power BI Premium or Premium Per User (PPU)?

The default maximum dataset size in a Power BI Premium capacity or PPU is 10GB, but if you turn on the Large Dataset storage format you can have datasets larger than that and the maximum size depends on the amount of memory available in your Premium capacity. The “Max memory per dataset” column in the table here shows the maximum amount of memory available in each Premium or Power BI Embedded capacity SKU to an individual dataset; the maximum amount of memory available per dataset in PPU is 100GB. However the maximum amount of memory available to a dataset is not the maximum size of a dataset: apart from the memory used to store the data for a dataset, more memory will be needed when the dataset is queried or refreshed. If you do a full refresh of a dataset it may require almost double the amount of memory needed to store the dataset itself; incremental refresh may require less memory. You can use the Premium Capacity Metrics App to see how your dataset’s memory usage changes over time.

Other Import mode limits

There are a few other limitations of Import mode that should be mentioned. As documented here you can only have 1,999,999,997 distinct values in a single column and there is a limit of 16,000 on the combined total number of tables and columns in those tables – but if you have that many tables and columns you have definitely made a mistake with how you model your data. Also, some DAX functions such as Median() only work on tables with less than 2 billion rows in them, as I blogged here.

Power BI can load all my data, but will my reports be fast enough to use?

Loading all your data into Power BI is one thing but what really matters is whether the reports you build are fast enough to use. In my experience performance problems relating to data volume are not as common as performance problems caused by inefficient DAX in measures or poor modelling decisions, although they do occur – for example distinct counts on columns with a very large number of unique values can be slow. If you’re following best practices you’re unlikely to encounter them unless you’re working with the kind of data volumes that require Premium. DAX Studio’s Server Timings feature can help you understand why your queries are slow and whether data volumes are an issue.

This is a lot to take in – what does it all mean?

Let me finish up by making a few broad generalisations about how much data you can load into Power BI Import mode. Assuming you have followed all the best practices around modelling data you should be able to work with tables with up to a few million rows, probably tens of millions of rows, in Shared capacity and tables with up to a few billion rows in Premium. If your data source is Excel then Power BI can definitely handle the amount of data you have; if you are working with a relational database like SQL Server then it’s still very likely Import mode will work; even if you’re working with hundreds of gigabytes of data or more in a source like Azure Synapse, Snowflake or BigQuery then Import mode may still be ok. You will need to know Power BI very well to get good performance in Import mode with the largest data volumes but it is certainly possible and I know of several customers that are doing it.