Using The “Skip Test Connection” Option For Power BI Data Sources To Avoid Scheduled Refresh Failures

One of the most popular blog posts I have written in recent years is this post on how the RelativePath and Query options for the Web.Contents() M function can be used to allow datasets that use web services as data sources to refresh successfully in the Power BI Service. It’s very frustrating when you write some complex M code in Power BI Desktop and find that your dataset can’t be refreshed after it has been published! While, at the time of writing, this problem is still present, yesterday’s announcement of the new “Skip Test Connection” option for data sources in the Power BI Service does mean there are more scenarios where your datasets can be made to refresh.

As I mentioned in that original post, the trick of using the RelativePath and Query options with Web.Contents() only works if the first parameter passed to Web.Contents() is a url that itself can be called on its own without an error. For example, the following expression:

Web.Contents(
	"http://www.myservice.com",
	[RelativePath="hello",
	 Query=[q="SearchTerm"]
	])

…will only refresh successfully in the Power BI Service if http://www.myservice.com can be reached as a url on its own without returning an error. I believe this is because this url is used when the Power BI Service is preparing to run the query to check whether the credentials it has stored are valid.  As the blog post says there is a workaround where dummy query parameter values can be used in that first parameter, but this won’t work in all cases. Instead, now you can turn on the “Skip Test Connection” option on the data source in the Power BI Service and the dataset will refresh even if http://www.myservice.com on its own returns an error.

Similarly this new option helps when you are using API keys passed in through query parameters for authentication. In Power BI Desktop and Excel Get&Transform/Power Query, you have the option to use the Web API authentication type to store sensitive API keys that need to be passed through url parameters securely rather than embed them in your M code. Unfortunately this is not supported in the Power BI Service yet and (as I found out on a recent consultancy engagement) there are other complications with using Anonymous authentication. For example, say you need to call a web service with an M expression something like this:

Web.Contents("http://www.myservice.com?key=password123")

…where the key query parameter contains an API key used for authentication. If you set this data source to use Anonymous authentication in Power BI Desktop or Excel it will refresh with no problems. However, after you publish to the Power BI Service you will not be able to refresh your dataset unless you turn on the “Skip Test Connection”, because the Power BI Service apparently strips out any query parameters from the url when it is testing the connection. In this case, if it strips out the key parameter, then obviously the call will fail because it has removed the means of authentication.

This new option does not mean every dataset can now be refreshed. As far as I can see, if you dynamically generate a url for use with Web.Contents() and do not use the RelativePath and/or Query options you will not be able refresh your dataset because the Power BI Service will not be able to analyse your code to discover what urls are present in it. I suspect that if you see the “Some data sources may not be listed because of hand-authored queries” warning in the Data Source Settings dialog in Power BI Desktop (which I blogged about here) then you will not be able to refresh your dataset in the Power BI Service, but I need to do more testing to be sure.

As always, if you test this yourself please let me know about any interesting things you find by leaving a comment!

Power Query Memory Usage, Dataflow Container Size And Refresh Performance

When Power BI dataflows were officially released a few weeks ago there was a new setting for Premium capacities mentioned briefly in the announcement blog post: Container Size.

The blog post only says the following:

We’re introducing a new dataflow workload on premium capacity settings, enabling you to optimize dataflow workload performance for processing more complex, compute-heavy dataflows. This setting is available in the Capacity Admin portal, Dataflow workload settings.

…which does not tell you much at all. Pedro Fernandes contacted me to see if I knew more and because I didn’t, I started investigating. This resulted in me learning lots of new information around how the Power Query engine in Power BI Desktop and Excel uses memory, how things are different in the Power BI service, and how all of this can have an impact on query refresh performance.

If you’ve read this blog post, and this related thread on the Power Query forum, you’ll know that when a Power Query query is evaluated the work is done by a Microsoft.Mashup.Container process, visible in tools such as Task Manager and Resource Monitor. A single refresh operation in Excel or Power BI Desktop might result in multiple evaluations of multiple queries for different reasons, so it’s not uncommon to see multiple Microsoft.Mashup.Container processes.

image

In another thread from the Power Query forum, Curt Hagenlocher of the dev team explains there is a 256MB limit on the amount of physical RAM that each Microsoft.Mashup.Container process can use, although there is is no limit on the amount of virtual memory that can be used. The thread is about how using Table.Buffer can be bad for refresh performance but the details are more widely applicable. Here are the highlights:

Certain operations force the rows of a table value to be enumerated. If the enumeration operation itself is expensive, then using Table.Buffer can be a performance optimization because we store the values in memory so that second and subsequent enumerations of the rows go against memory.

If the table is only being enumerated once (which is the most common scenario) or if the underlying enumeration is fast anyway, then Table.Buffer won’t help performance.

Table.Buffer can actually hurt performance in some cases, because we cap RAM usage of the query at 256 MB — which means that a query which uses more than 256 MB is now forced to page RAM to/from disk. Enough paging, and the performance cost can be quite dramatic.

Currently, “table at a time” operations like joins, sort, many groupings, pivot, unpivot, etc., all happen in RAM (unless folded). For large tables, these will consume a lot of memory.

 

The 256MB limit is also mentioned briefly on this thread.

Here’s a screenshot from Resource Monitor showing this: the Microsoft.Mashup.Container process in this case is evaluating a query that reads data from a large (900,000ish rows) csv file and does a Pivot, and while the Working Set value has peaked at around the 256MB limit you can see the Commit value is much higher, and so paging must be taking place.

image

Currently there is no way to change this 256MB in Power BI Desktop or Excel although someone has already posted a suggestion on the Ideas site to allow us to change it. How much of an impact does this actually have on refresh performance though? Without the ability to change this setting it’s hard to say, but I suspect it could be significant and that a lot of Power Query performance problems could be explained by this behaviour.

The situation is different in the Power BI service, where I understand there is a limit on the overall amount of memory that a single Power Query query evaluation can use. This makes a lot of sense in the context of Power BI Pro and shared capacity because Microsoft could not allow one user to run lots of complex, expensive Power Query queries that might affect other users inside or outside the same tenant. With Power BI Premium, which gives you your own dedicated capacity in the Power BI service, there is no chance that anything you do will affect other tenants and so Microsoft gives you more control over how resources are used. As a result, the new Container Size setting for a dataflow in a Premium capacity lets you configure the amount of memory that can be used for a single entity refresh within a dataflow – and refreshing a single entity in a dataflow is, as far as I understand it, the equivalent of what a Microsoft.Mashup.Container process does on the desktop.

I did some (not very scientific) testing and it looks like increasing the Container Size setting can have a noticeable impact on the performance of memory-intensive queries. Using the technique I blogged about here I measured the execution time of the query mentioned above that does a Pivot on data from a large, local csv file in Power BI Desktop: it took 98 seconds on my laptop. I then used a dataflow to load the source data from the csv file into an entity in a dataflow on an A4 capacity, without making any changes. This took 132 seconds; you can get the time taken for a dataflow refresh by clicking on the Refresh History option, as Matthew Roche shows here. I then created a computed entity that used this new entity as its source and which did the same Pivot operation as the original query on the desktop. The following table shows the time taken to refresh this computed entity in the A4 capacity with different Container Size settings:

Container Size (MB) Refresh Time (seconds)
700 61
2000 57
5000 52

700MB is the default setting for Container Size and the smallest value that you can use; I guess the maximum value you can set will depend on the size of the capacity you’re using. There are two conclusions that I think you can draw from these results:

  • Even with the default setting for Container Size, it was faster for the computed entity to read the data from the source entity (which, remember, stores its data in Azure Data Lake Gen2 storage) and do the Pivot in the Power BI service than it was for Power Query in Power BI Desktop to read the data from the csv file and do the same Pivot operation on my laptop.
  • Increasing the Container Size setting reduced refresh time quite significantly.

So, as the blog post I referenced at the very beginning of this post states, if you are doing memory-intensive operations such as group bys, sorts, pivots, unpivots and joins against non-foldable data sources in a dataflow, and if that dataflow is on a Premium capacity, then increasing the Container Size property is probably a good idea because it may reduce refresh times. If you can reproduce this on your own Premium capacities please let me know by leaving a comment – I would be very interested to hear about your experiences.

[Thanks to Curt Hagenlocher and Anton Fritz for providing information for this blog post]

BI Survey 19

The BI Survey is the largest annual survey of BI users in the world, and every year I get a free copy of the results (which are always very interesting) in return for publicising it here on my blog. If you take part, you’ll also get a summary of the results and be entered into a draw for some Amazon vouchers. Here’s the link to take the survey:

https://www.efs-survey.com/uc/BARC_GmbH/f42f/?a=101

Power BI did very well last year and I’m sure it will do even better this year!

Table.Buffer() Does Not Buffer Nested Values

Here’s yet another entry in the list of useful things I learned from Ehren von Lehe on the Power Query MSDN forum: Table.Buffer() does not buffer nested table, record or list values inside the cells of a table. From this thread:

Buffering is shallow. It will force the evaluation of any scalar cell values, but will leave non-scalar values (records, lists, tables, etc.) as-is.

It’s actually really difficult to come up with a simple demo query to prove this though (the Power Query engine is too clever about not evaluating things it doesn’t need for the final output of a query), but it’s fairly easy to understand the principle. Whenever you have an expression that returns a table something like this:

image

…if you use Table.Buffer() on this table it will only buffer the scalar values (in this case the text values in the Name and Signature columns). The nested table values, as in the highlighted cell, will not be buffered in memory and if you try to access the contents it may result in another call back to the underlying data source.

Comparing The Performance Of Reading Data From Files With File.Contents And Web.Contents In Power Query And Power BI

In my last post I mentioned the Power Query engine’s persistent cache, which in some scenarios caches the data read from a data source when a query is refreshed. Another important nugget of information that Ehren von Lehe of the Power Query dev team mentioned in a post on the Power Query MSDN forum recently is the fact that if you use File.Contents to get data from a file then the persistent cache is not used, but if you use Web.Contents to get data from the same file then the persistent cache is used. I guess the thinking here is that there is no point creating an on-disk cache containing the contents of a file that is already on disk.

Using Process Monitor (see here and here for more information on how to do this) to view how much data is read from disk when a query is run it is possible to see this in action. Here’s the graph showing how much data is read from a 150MB csv file when a slow query is refreshed. This particular slow query results in five reads to the csv file – it’s more or less the same scenario from my second Process Monitor blog post here, with the query itself described in a lot of detail here. Here’s a graph of the data captured by Process Monitor showing time in seconds on the X axis and amount of data read in MB on the Y axis:

image

Since this query uses File.Contents to get the data from the csv file, the persistent cache is not used; as you can see, the graph shows clearly that the full contents of the csv file are read five times.

The same query altered to use Web.Contents shows just two full reads:

image

I have no idea why the file is read twice rather than once, but it’s definitely different behaviour to the version that uses File.Contents.

As far as I can see it is possible to replace File.Contents with Web.Contents in every case. So, if you have the following expression:

File.Contents("c:\users\myuser\Desktop\file.txt")

You can just replace it with:

Web.Contents("c:\users\myuser\Desktop\file.txt")

Which one is faster though? Just because a query reads data from disk more often does not necessarily mean that it will be slower. In the above scenario, with the csv file stored on my local hard drive, the Web.Contents version of the query refreshes in 18 seconds while the File.Contents version refreshes in 14 seconds. Replacing the csv file with an Excel file that contains the same data (remember that Excel files are a lot slower than csv files to read data from, as I showed here) results in the version of the query that uses File.Contents running in 205 seconds, while the version that uses Web.Contents running in 297 seconds. So it looks like, in most cases, File.Contents is the right choice when reading data from a file (as you would hope).

However, when using the same csv file stored on a network file share, the Web.Contents version takes 23 seconds while the File.Contents version takes 25 seconds. So maybe if you are dealing with files that are stored remotely over a slow connection it might be worth replacing File.Contents with Web.Contents to see if you get any performance benefits. There may be other situations where Web.Contents is the faster choice too. If you test this and see a difference, let me know by leaving a comment!

UPDATE: please also read Curt Hagenlocher’s comment below – Web.Contents may be changed in the future so it only works with http/https

%d bloggers like this: