Power Query Memory Usage, Dataflow Container Size And Refresh Performance

When Power BI dataflows were officially released a few weeks ago there was a new setting for Premium capacities mentioned briefly in the announcement blog post: Container Size.

The blog post only says the following:

We’re introducing a new dataflow workload on premium capacity settings, enabling you to optimize dataflow workload performance for processing more complex, compute-heavy dataflows. This setting is available in the Capacity Admin portal, Dataflow workload settings.

…which does not tell you much at all. Pedro Fernandes contacted me to see if I knew more and because I didn’t, I started investigating. This resulted in me learning lots of new information around how the Power Query engine in Power BI Desktop and Excel uses memory, how things are different in the Power BI service, and how all of this can have an impact on query refresh performance.

If you’ve read this blog post, and this related thread on the Power Query forum, you’ll know that when a Power Query query is evaluated the work is done by a Microsoft.Mashup.Container process, visible in tools such as Task Manager and Resource Monitor. A single refresh operation in Excel or Power BI Desktop might result in multiple evaluations of multiple queries for different reasons, so it’s not uncommon to see multiple Microsoft.Mashup.Container processes.

image

In another thread from the Power Query forum, Curt Hagenlocher of the dev team explains there is a 256MB limit on the amount of physical RAM that each Microsoft.Mashup.Container process can use, although there is is no limit on the amount of virtual memory that can be used. The thread is about how using Table.Buffer can be bad for refresh performance but the details are more widely applicable. Here are the highlights:

Certain operations force the rows of a table value to be enumerated. If the enumeration operation itself is expensive, then using Table.Buffer can be a performance optimization because we store the values in memory so that second and subsequent enumerations of the rows go against memory.

If the table is only being enumerated once (which is the most common scenario) or if the underlying enumeration is fast anyway, then Table.Buffer won’t help performance.

Table.Buffer can actually hurt performance in some cases, because we cap RAM usage of the query at 256 MB — which means that a query which uses more than 256 MB is now forced to page RAM to/from disk. Enough paging, and the performance cost can be quite dramatic.

Currently, “table at a time” operations like joins, sort, many groupings, pivot, unpivot, etc., all happen in RAM (unless folded). For large tables, these will consume a lot of memory.

 

The 256MB limit is also mentioned briefly on this thread.

Here’s a screenshot from Resource Monitor showing this: the Microsoft.Mashup.Container process in this case is evaluating a query that reads data from a large (900,000ish rows) csv file and does a Pivot, and while the Working Set value has peaked at around the 256MB limit you can see the Commit value is much higher, and so paging must be taking place.

image

Currently there is no way to change this 256MB in Power BI Desktop or Excel although someone has already posted a suggestion on the Ideas site to allow us to change it. How much of an impact does this actually have on refresh performance though? Without the ability to change this setting it’s hard to say, but I suspect it could be significant and that a lot of Power Query performance problems could be explained by this behaviour.

The situation is different in the Power BI service, where I understand there is a limit on the overall amount of memory that a single Power Query query evaluation can use. This makes a lot of sense in the context of Power BI Pro and shared capacity because Microsoft could not allow one user to run lots of complex, expensive Power Query queries that might affect other users inside or outside the same tenant. With Power BI Premium, which gives you your own dedicated capacity in the Power BI service, there is no chance that anything you do will affect other tenants and so Microsoft gives you more control over how resources are used. As a result, the new Container Size setting for a dataflow in a Premium capacity lets you configure the amount of memory that can be used for a single entity refresh within a dataflow – and refreshing a single entity in a dataflow is, as far as I understand it, the equivalent of what a Microsoft.Mashup.Container process does on the desktop.

I did some (not very scientific) testing and it looks like increasing the Container Size setting can have a noticeable impact on the performance of memory-intensive queries. Using the technique I blogged about here I measured the execution time of the query mentioned above that does a Pivot on data from a large, local csv file in Power BI Desktop: it took 98 seconds on my laptop. I then used a dataflow to load the source data from the csv file into an entity in a dataflow on an A4 capacity, without making any changes. This took 132 seconds; you can get the time taken for a dataflow refresh by clicking on the Refresh History option, as Matthew Roche shows here. I then created a computed entity that used this new entity as its source and which did the same Pivot operation as the original query on the desktop. The following table shows the time taken to refresh this computed entity in the A4 capacity with different Container Size settings:

Container Size (MB) Refresh Time (seconds)
700 61
2000 57
5000 52

700MB is the default setting for Container Size and the smallest value that you can use; I guess the maximum value you can set will depend on the size of the capacity you’re using. There are two conclusions that I think you can draw from these results:

  • Even with the default setting for Container Size, it was faster for the computed entity to read the data from the source entity (which, remember, stores its data in Azure Data Lake Gen2 storage) and do the Pivot in the Power BI service than it was for Power Query in Power BI Desktop to read the data from the csv file and do the same Pivot operation on my laptop.
  • Increasing the Container Size setting reduced refresh time quite significantly.

So, as the blog post I referenced at the very beginning of this post states, if you are doing memory-intensive operations such as group bys, sorts, pivots, unpivots and joins against non-foldable data sources in a dataflow, and if that dataflow is on a Premium capacity, then increasing the Container Size property is probably a good idea because it may reduce refresh times. If you can reproduce this on your own Premium capacities please let me know by leaving a comment – I would be very interested to hear about your experiences.

[Thanks to Curt Hagenlocher and Anton Fritz for providing information for this blog post]

BI Survey 19

The BI Survey is the largest annual survey of BI users in the world, and every year I get a free copy of the results (which are always very interesting) in return for publicising it here on my blog. If you take part, you’ll also get a summary of the results and be entered into a draw for some Amazon vouchers. Here’s the link to take the survey:

https://www.efs-survey.com/uc/BARC_GmbH/f42f/?a=101

Power BI did very well last year and I’m sure it will do even better this year!

Table.Buffer() Does Not Buffer Nested Values

Here’s yet another entry in the list of useful things I learned from Ehren von Lehe on the Power Query MSDN forum: Table.Buffer() does not buffer nested table, record or list values inside the cells of a table. From this thread:

Buffering is shallow. It will force the evaluation of any scalar cell values, but will leave non-scalar values (records, lists, tables, etc.) as-is.

It’s actually really difficult to come up with a simple demo query to prove this though (the Power Query engine is too clever about not evaluating things it doesn’t need for the final output of a query), but it’s fairly easy to understand the principle. Whenever you have an expression that returns a table something like this:

image

…if you use Table.Buffer() on this table it will only buffer the scalar values (in this case the text values in the Name and Signature columns). The nested table values, as in the highlighted cell, will not be buffered in memory and if you try to access the contents it may result in another call back to the underlying data source.

Comparing The Performance Of Reading Data From Files With File.Contents And Web.Contents In Power Query And Power BI

In my last post I mentioned the Power Query engine’s persistent cache, which in some scenarios caches the data read from a data source when a query is refreshed. Another important nugget of information that Ehren von Lehe of the Power Query dev team mentioned in a post on the Power Query MSDN forum recently is the fact that if you use File.Contents to get data from a file then the persistent cache is not used, but if you use Web.Contents to get data from the same file then the persistent cache is used. I guess the thinking here is that there is no point creating an on-disk cache containing the contents of a file that is already on disk.

Using Process Monitor (see here and here for more information on how to do this) to view how much data is read from disk when a query is run it is possible to see this in action. Here’s the graph showing how much data is read from a 150MB csv file when a slow query is refreshed. This particular slow query results in five reads to the csv file – it’s more or less the same scenario from my second Process Monitor blog post here, with the query itself described in a lot of detail here. Here’s a graph of the data captured by Process Monitor showing time in seconds on the X axis and amount of data read in MB on the Y axis:

image

Since this query uses File.Contents to get the data from the csv file, the persistent cache is not used; as you can see, the graph shows clearly that the full contents of the csv file are read five times.

The same query altered to use Web.Contents shows just two full reads:

image

I have no idea why the file is read twice rather than once, but it’s definitely different behaviour to the version that uses File.Contents.

As far as I can see it is possible to replace File.Contents with Web.Contents in every case. So, if you have the following expression:

File.Contents("c:\users\myuser\Desktop\file.txt")

You can just replace it with:

Web.Contents("c:\users\myuser\Desktop\file.txt")

Which one is faster though? Just because a query reads data from disk more often does not necessarily mean that it will be slower. In the above scenario, with the csv file stored on my local hard drive, the Web.Contents version of the query refreshes in 18 seconds while the File.Contents version refreshes in 14 seconds. Replacing the csv file with an Excel file that contains the same data (remember that Excel files are a lot slower than csv files to read data from, as I showed here) results in the version of the query that uses File.Contents running in 205 seconds, while the version that uses Web.Contents running in 297 seconds. So it looks like, in most cases, File.Contents is the right choice when reading data from a file (as you would hope).

However, when using the same csv file stored on a network file share, the Web.Contents version takes 23 seconds while the File.Contents version takes 25 seconds. So maybe if you are dealing with files that are stored remotely over a slow connection it might be worth replacing File.Contents with Web.Contents to see if you get any performance benefits. There may be other situations where Web.Contents is the faster choice too. If you test this and see a difference, let me know by leaving a comment!

UPDATE: please also read Curt Hagenlocher’s comment below – Web.Contents may be changed in the future so it only works with http/https

Power BI, Caching, Parallelism And Power Query Refresh Performance

Some time ago a customer of mine (thank you, Robert Lochner) showed me a very interesting scenario where a set of Power Query queries in Power BI Desktop refreshed a lot faster with the “Enable Parallel Loading Of Tables” option turned off. This seemed a bit strange, but I have recently been reading lots of posts by Ehren von Lehe of the Power Query dev team on caching and query evaluation on the Power Query MSDN forum and two posts in particular, here and here, offered an explanation of what was going on. In this post I will walk through a very simple scenario that illustrates the information in that post and sheds some light on the internals of the Power Query engine.

Let’s say you have very simple web service built in Flow that, when it receives a GET request, waits 5 seconds and then returns the text “Hello”:

image

Here’s the M code for a Power Query query called SourceQuery that calls this web service and returns a table with one row and one column containing the text “Hello”:

let
    Source = Web.Contents("https://entermyurlhere.com"),
    ToLines = Lines.FromBinary(Source, null, null, 65001),
    ToTable = Table.FromColumns({ToLines})
in
    ToTable

image

Let’s also say you have four other identical queries called R1, R2, R3 and R4 that reference this query and do nothing but return the same table. Here’s the M code for these queries:

let
    Source = SourceQuery
in
    Source

Now, here’s the big question! If the output of SourceQuery is not loaded into Power BI but the output of R1, R2, R3 and R4 is loaded, as follows:

image

How many times would you expect the web service to be called when you refresh your dataset? One or four?

With the setting of “Enable Parallel Loading Of Tables” turned on, which is the default, the “Enable Data Preview To Download In The Background” setting turned off and Data Privacy checks turned off, Fiddler (see here for how to use it) shows the web service is called four times. SQL Server Profiler (see here for how to use it with Power BI) shows that the four queries are run in parallel and each query takes just over 5 seconds.

This seems wrong, because your natural inclination is to think that the Power Query engine evaluates SourceQuery first, and after that has happened the data passes from there to R1, R2, R3 and R4. In fact what happens is the reverse: R1, R2, R3 and R4 are evaluated in parallel and because each of them reference SourceQuery, then each of them causes SourceQuery to be evaluated independently. Yet another post by Ehren on the forum here explains this in detail.

The Power Query engine does do some caching in a ‘persistent cache’ that can be shared between queries and which stores the data requested from some types of data source on disk – it will store the data returned by Web.Contents for example, but not File.Contents – and only in some situations. The key statements about the persistent cache from this thread relevant to this particular problem are:

Power Query’s persistent cache (which stores data on disk during a particular refresh) is updated via a background thread. And separate evaluations (i.e. separate Microsoft.Mashup.Container.*.exe processes) running at the same time are not coordinated; when evaluation A is accessing the persistent cache (including updating it), this doesn’t block evaluation B from accessing the cache. This means that even when using a shared persistent cache, PQ can potentially end up requesting the same data twice. It depends on the timing of the various requests.

and

…you can reduce the number of evaluations happening at any given time (and thus the likelihood of timing-related cache misses, or additional unwanted requests) by doing the following:

  • Disabling background analysis (which pre-calculates the PQ Editor previews behind the scenes)
  • Disabling the Data Privacy Firewall (which does its own evaluations during refresh that could potentially cause duplicate requests, and also only consume a subset of the data in certain cases)

In PBIDesktop, you can also reduce the likelihood of timing-related cache misses by disabling parallel loading. (In Excel, the loading of multiple queries is always sequential)

As a result, with the “Enable Parallel Loading Of Tables” setting turned off, so the queries above are evaluated in series, Fiddler shows the web service is called only once. What’s more, Profiler shows that only one of the queries out of R1, R2, R3 and R4 takes just over five seconds while the other three, executed after it, are almost instant.

This must be because, when the queries are executed in series, the first query refreshes and data gets loaded into Power Query’s persistent cache. After that, even though the three subsequent queries also cause SourceQuery to be evaluated, each time SourceQuery is evaluated it can reuse the data stored in the persistent cache.

In contrast, with the four queries refreshed in parallel, each evaluation of SourceQuery takes place before the persistent cache has been populated and therefore the web service is called four times. This is slower, but in this case not much slower, than the scenario with the four queries executed in series – each query evaluation takes just over five seconds but remember that they are being executed in parallel so the overall duration is not much more than five seconds. In other cases the difference between parallel and sequential query execution could be a lot larger.

Now for the bad news: the “Enable Parallel Loading Of Tables” setting only works in Power BI Desktop, and there is no equivalent setting in the Power BI service. When a dataset is refreshed in the service, as far as I can tell the queries in the dataset are always evaluated in parallel. In Excel, as Ehren says, all queries are executed sequentially.

Is it possible to make sure the web service is called only once with all the queries executed in parallel? Yes, but not in a completely reliable way. The first thing to say here is that Table.Buffer and related buffering functions are not, as far as I understand it, useful here: they buffer data in memory within a single chain of execution, and in this case we have four separate chains of execution for R1, R2, R3 and R4. Instead what I have found that works is inserting a delay that gives the persistent cache time to be populated. If you keep the SourceQuery and R1 queries the same, and then alter the M code for R2, R3 and R4 so they wait for ten seconds (using Function.InvokeAfter) before returning as follows:

let
    Source = Function.InvokeAfter(
        ()=>SourceQuery, 
        #duration(0,0,0,10)
        )
in
    Source

 

In this particular case this results in the slowest refresh times and it only works because I can be sure that the call to the web service takes five seconds.

There’s one last thing to say here: wouldn’t it be nice if we could ensure that the web service was only called once, and all subsequent queries got their data from the a persisted copy of the data? As Matthew Roche points out here, this is exactly what dataflows allow you to do. Taking SourceQuery and turning it into an entity in a dataflow would result in a single call to the web service when the entity is refreshed, the data from the web service being persisted in the dataflow. You would then just need to have the queries R1, R2, R3 and R4 in your dataset, change them so they get their data from the entity in the dataflow, and as a result they would use the persisted copy of the data in the dataflow and would not call the web service. Another reason to use dataflows!

[I hope all the information in this post is correct  – it is based on Ehren’s forum posts and my own observations alone. If anyone from the Power Query dev team is reading this and spots an error, please let me know]

Power BI Sentinel: Backup, Documentation, Change Tracking And Lineage Tracking For Power BI

A few weeks ago at SQLBits I had a demo of a very interesting new tool for Power BI users called Power BI Sentinel. The website, with all the details, is here:

https://www.powerbisentinel.com

It helps solve several problems that everyone managing a Power BI deployment has to deal with. It can:

  • Backup reports and datasets (as .pbix files) direct from your App Workspaces to Azure Blob storage on a schedule, so you are able to access earlier versions and roll back if you need to.
  • Generate documentation on your datasets, including the DAX calculations used.
  • Identify what visuals and filters have changed in different versions of a report, and when those changes took place.
  • Track which data sources are used by which datasets and, in turn, which reports use those datasets.

image

It’s still a very new tool and adding functionality all the time; there is also a lot of functionality that you’d want from a tool like this that Microsoft hasn’t built the APIs to support yet. It’s definitely worth checking out though and (in my opinion) quite reasonably priced. I dare say Microsoft will build some of this functionality into Power BI at some point, but I don’t know what it will deliver exactly or when that will happen.

Full disclosure: I’ve known the owners of the company for a long time through their involvement with the UK SQL Server and Power BI community, and I was given a free licence for Power BI Sentinel by them.

Defining Relationships Between Entities In The Common Data Model To Automatically Create Relationships In Power BI

Following on from my last post on attaching manually-created Common Data Model folders in Power BI dataflows, I was asked whether defining relationships between entities in the model.json file of the CDM folder results in relationships being created between tables in a Power BI dataset. I’ve just tested it, and I can confirm it does.

Consider a CDM folder that contains two entities, Sales and Fruit. The Sales entity contains monthly sales data:

image

…and the Fruit entity contains a list of fruit sold:

image

Now –and this is important – let’s say you have defined the entities in the model.json so that the Sales entity  has attributes Month, Product and Sales:

image

…and the Fruit entity has a single attribute called Fruit:

image

Because the Sales entity and the Fruit entity use different names for the attributes that contain the names of the fruit sold, when the two entities are loaded into Power BI no relationships are created between the two tables. There’s no way that Power BI can guess that there should be a relationship between the Product and Fruit columns based on these names.

However, if you add a relationship definition (as documented here) to the model.json file like so:

image

Then, when you load the two entities as tables in the same Power BI dataset, you get a relationship created automatically:

image

As far as I can see it does this by adding a key definition to the Sales table in a similar way to the way the Table.AddKeyM M function/Remove Duplicates transformation does, as I blogged here.

Now if we could add DAX calculations (calculated columns and especially measures) to the definition of the model.json file, so they were automatically created when entities were imported from a dataflow, that would be really cool. I don’t see why this would not be possible because you could store all kinds of useful information – such as the DAX for the calculations – in the metadata record of an M query (you can access this yourself using the Value.Metadata M function) loaded into a dataset, and the Power BI engine could read it when the table is loaded and create the calculations from this.

%d bloggers like this: