Chris Webb's BI Blog: Testing The Performance Impact Of AzureStorage.DataLake() Options On Power BI Refresh Performance

Continuing my series on tuning the performance of importing data from ADLSgen2 into Power BI, in this post I’m going to look at the performance impact of setting some of the various options in the second parameter of the AzureStorage.DataLake() M function. In the last post in this series I showed how setting the HierarchicalNavigation option can improve refresh performance, but what about BlockSize, RequestSize or ConcurrentRequests?

Here’s what the documentation says about these options:

BlockSize : The number of bytes to read before waiting on the data consumer. The default value is 4 MB.
RequestSize : The number of bytes to try to read in a single HTTP request to the server. The default value is 4 MB.
ConcurrentRequests : The ConcurrentRequests option supports faster download of data by specifying the number of requests to be made in parallel, at the cost of memory utilization. The memory required is (ConcurrentRequest * RequestSize). The default value is 16.

Using the same 8 million row set of csv files I have used in my previous posts and the same queries generated by the From Folder source (see this post for more details – note that in this post I am not using Synapse Serverless, just loading direct from the files), I tested various options. Here’s an example of how these options can be set:

AzureStorage.DataLake(
  "https://xyz.dfs.core.windows.net/myfolder",
  [ConcurrentRequests = 1]
)

Here are the average dataset refresh times measured in the Power BI Service using Profiler:

Option	Average Refresh Time (seconds)
None set – defaults used	67
ConcurrentRequests=1	70
ConcurrentRequests=32	67
BlockSize=1	70
BlockSize=8388608 (8MB)	68
RequestSize=1	Error (see below)
RequestSize=8388608 (8MB)	68
ConcurrentRequests=32, BlockSize=8388608, RequestSize=8388608	67

From these results it looks like it’s possible to make performance slightly worse in some cases but none of the configurations tested made performance better than the default settings.

There are two somewhat interesting things to note. First, this is pretty much what the developers told me to expect when I asked about these options a while ago. However I was told that there may be some scenarios where reducing the value of ConcurrentRequests can be useful to reduce the memory overhead of a Power Query query – I guess to avoid paging on the Desktop (as discussed here) or memory errors in the Power BI Service.

Second, when I set RequestSize=1 (which means that each HTTP request was only allowed to return 1 byte of data, which is a pretty strange thing to want to do) I got the following error:

Expression.Error: The evaluation reached the allowed cache entry size limit. Try increasing the allowed cache size.

This reminds me I need to do some reasearch into how the Power Query cache works in Power BI Desktop and write that up as a post…

Overall, no major revelations here, but sometimes it’s good to know what doesn’t make any difference as much as what does.

Update 3/3/2021: read this post to see the results of some testing I did which shows why changing these options didn’t have much impact on refresh peformance.

4 thoughts on “Testing The Performance Impact Of AzureStorage.DataLake() Options On Power BI Refresh Performance”

Jerome says:

March 2, 2021 at 2:36 am

what about the azure side?
I’m sure you are at the maximum performance you can get there.

because the problem is more how fast Azure is, and its pretty bad if you dont pay the big price.
most of the IO in Azure is low, very low compared to a standard SSD.
you should compare the loading of the data from your local SSD.

Loading...

Kyle Hale says:

March 2, 2021 at 5:44 am

Have you tried loading data from CSVs using the Query Acceleration option in Azure Data Lake?

https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-query-acceleration

Curious to see some actual performance numbers there, benchmarks are hard to come by …

Loading...

1. Chris Webb says:
  
  March 2, 2021 at 8:43 am
  
  Not yet, but it’s on the (very long) list of things to test.
  
  Loading...
  
Pingback: Chris Webb's BI Blog: Measuring The Performance Of AzureStorage.DataLake() Using Power Query Query Diagnostics Chris Webb's BI Blog

Testing The Performance Impact Of AzureStorage.DataLake() Options On Power BI Refresh Performance

Like this:

Published by Chris Webb

4 thoughts on “Testing The Performance Impact Of AzureStorage.DataLake() Options On Power BI Refresh Performance”

Leave a ReplyCancel reply

Share this:

Like this:

Published by Chris Webb

4 thoughts on “Testing The Performance Impact Of AzureStorage.DataLake() Options On Power BI Refresh Performance”

Leave a ReplyCancel reply

Discover more from Chris Webb's BI Blog