When Power BI dataflows were officially released a few weeks ago there was a new setting for Premium capacities mentioned briefly in the announcement blog post: Container Size.
The blog post only says the following:
We’re introducing a new dataflow workload on premium capacity settings, enabling you to optimize dataflow workload performance for processing more complex, compute-heavy dataflows. This setting is available in the Capacity Admin portal, Dataflow workload settings.
…which does not tell you much at all. Pedro Fernandes contacted me to see if I knew more and because I didn’t, I started investigating. This resulted in me learning lots of new information around how the Power Query engine in Power BI Desktop and Excel uses memory, how things are different in the Power BI service, and how all of this can have an impact on query refresh performance.
If you’ve read this blog post, and this related thread on the Power Query forum, you’ll know that when a Power Query query is evaluated the work is done by a Microsoft.Mashup.Container process, visible in tools such as Task Manager and Resource Monitor. A single refresh operation in Excel or Power BI Desktop might result in multiple evaluations of multiple queries for different reasons, so it’s not uncommon to see multiple Microsoft.Mashup.Container processes.
In another thread from the Power Query forum, Curt Hagenlocher of the dev team explains there is a 256MB limit on the amount of physical RAM that each Microsoft.Mashup.Container process can use, although there is is no limit on the amount of virtual memory that can be used. The thread is about how using Table.Buffer can be bad for refresh performance but the details are more widely applicable. Here are the highlights:
Certain operations force the rows of a table value to be enumerated. If the enumeration operation itself is expensive, then using Table.Buffer can be a performance optimization because we store the values in memory so that second and subsequent enumerations of the rows go against memory.
If the table is only being enumerated once (which is the most common scenario) or if the underlying enumeration is fast anyway, then Table.Buffer won’t help performance.
Table.Buffer can actually hurt performance in some cases, because we cap RAM usage of the query at 256 MB — which means that a query which uses more than 256 MB is now forced to page RAM to/from disk. Enough paging, and the performance cost can be quite dramatic.
Currently, “table at a time” operations like joins, sort, many groupings, pivot, unpivot, etc., all happen in RAM (unless folded). For large tables, these will consume a lot of memory.
The 256MB limit is also mentioned briefly on this thread.
Here’s a screenshot from Resource Monitor showing this: the Microsoft.Mashup.Container process in this case is evaluating a query that reads data from a large (900,000ish rows) csv file and does a Pivot, and while the Working Set value has peaked at around the 256MB limit you can see the Commit value is much higher, and so paging must be taking place.
Currently there is no way to change this 256MB in Power BI Desktop or Excel although someone has already posted a suggestion on the Ideas site to allow us to change it. How much of an impact does this actually have on refresh performance though? Without the ability to change this setting it’s hard to say, but I suspect it could be significant and that a lot of Power Query performance problems could be explained by this behaviour.
The situation is different in the Power BI service, where I understand there is a limit on the overall amount of memory that a single Power Query query evaluation can use. This makes a lot of sense in the context of Power BI Pro and shared capacity because Microsoft could not allow one user to run lots of complex, expensive Power Query queries that might affect other users inside or outside the same tenant. With Power BI Premium, which gives you your own dedicated capacity in the Power BI service, there is no chance that anything you do will affect other tenants and so Microsoft gives you more control over how resources are used. As a result, the new Container Size setting for a dataflow in a Premium capacity lets you configure the amount of memory that can be used for a single entity refresh within a dataflow – and refreshing a single entity in a dataflow is, as far as I understand it, the equivalent of what a Microsoft.Mashup.Container process does on the desktop.
I did some (not very scientific) testing and it looks like increasing the Container Size setting can have a noticeable impact on the performance of memory-intensive queries. Using the technique I blogged about here I measured the execution time of the query mentioned above that does a Pivot on data from a large, local csv file in Power BI Desktop: it took 98 seconds on my laptop. I then used a dataflow to load the source data from the csv file into an entity in a dataflow on an A4 capacity, without making any changes. This took 132 seconds; you can get the time taken for a dataflow refresh by clicking on the Refresh History option, as Matthew Roche shows here. I then created a computed entity that used this new entity as its source and which did the same Pivot operation as the original query on the desktop. The following table shows the time taken to refresh this computed entity in the A4 capacity with different Container Size settings:
|Container Size (MB)||Refresh Time (seconds)|
700MB is the default setting for Container Size and the smallest value that you can use; I guess the maximum value you can set will depend on the size of the capacity you’re using. There are two conclusions that I think you can draw from these results:
- Even with the default setting for Container Size, it was faster for the computed entity to read the data from the source entity (which, remember, stores its data in Azure Data Lake Gen2 storage) and do the Pivot in the Power BI service than it was for Power Query in Power BI Desktop to read the data from the csv file and do the same Pivot operation on my laptop.
- Increasing the Container Size setting reduced refresh time quite significantly.
So, as the blog post I referenced at the very beginning of this post states, if you are doing memory-intensive operations such as group bys, sorts, pivots, unpivots and joins against non-foldable data sources in a dataflow, and if that dataflow is on a Premium capacity, then increasing the Container Size property is probably a good idea because it may reduce refresh times. If you can reproduce this on your own Premium capacities please let me know by leaving a comment – I would be very interested to hear about your experiences.
[Thanks to Curt Hagenlocher and Anton Fritz for providing information for this blog post]