Testing The Performance Impact Of AzureStorage.DataLake() Options On Power BI Refresh Performance

Continuing my series on tuning the performance of importing data from ADLSgen2 into Power BI, in this post I’m going to look at the performance impact of setting some of the various options in the second parameter of the AzureStorage.DataLake() M function. In the last post in this series I showed how setting the HierarchicalNavigation option can improve refresh performance, but what about BlockSize, RequestSize or ConcurrentRequests?

Here’s what the documentation says about these options:

  • BlockSize : The number of bytes to read before waiting on the data consumer. The default value is 4 MB.
  • RequestSize : The number of bytes to try to read in a single HTTP request to the server. The default value is 4 MB.
  • ConcurrentRequests : The ConcurrentRequests option supports faster download of data by specifying the number of requests to be made in parallel, at the cost of memory utilization. The memory required is (ConcurrentRequest * RequestSize). The default value is 16.

Using the same 8 million row set of csv files I have used in my previous posts and the same queries generated by the From Folder source (see this post for more details – note that in this post I am not using Synapse Serverless, just loading direct from the files), I tested various options. Here’s an example of how these options can be set:

AzureStorage.DataLake(
  "https://xyz.dfs.core.windows.net/myfolder",
  [ConcurrentRequests = 1]
)

Here are the average dataset refresh times measured in the Power BI Service using Profiler:

OptionAverage Refresh Time (seconds)
None set – defaults used67
ConcurrentRequests=170
ConcurrentRequests=3267
BlockSize=170
BlockSize=8388608 (8MB)68
RequestSize=1Error (see below)
RequestSize=8388608 (8MB)68
ConcurrentRequests=32,
BlockSize=8388608,
RequestSize=8388608
67

From these results it looks like it’s possible to make performance slightly worse in some cases but none of the configurations tested made performance better than the default settings.

There are two somewhat interesting things to note. First, this is pretty much what the developers told me to expect when I asked about these options a while ago. However I was told that there may be some scenarios where reducing the value of ConcurrentRequests can be useful to reduce the memory overhead of a Power Query query – I guess to avoid paging on the Desktop (as discussed here) or memory errors in the Power BI Service.

Second, when I set RequestSize=1 (which means that each HTTP request was only allowed to return 1 byte of data, which is a pretty strange thing to want to do) I got the following error:

Expression.Error: The evaluation reached the allowed cache entry size limit. Try increasing the allowed cache size.

This reminds me I need to do some reasearch into how the Power Query cache works in Power BI Desktop and write that up as a post…

Overall, no major revelations here, but sometimes it’s good to know what doesn’t make any difference as much as what does.

Update 3/3/2021: read this post to see the results of some testing I did which shows why changing these options didn’t have much impact on refresh peformance.

4 responses

  1. what about the azure side?
    I’m sure you are at the maximum performance you can get there.

    because the problem is more how fast Azure is, and its pretty bad if you dont pay the big price.
    most of the IO in Azure is low, very low compared to a standard SSD.
    you should compare the loading of the data from your local SSD.

  2. Pingback: Chris Webb's BI Blog: Measuring The Performance Of AzureStorage.DataLake() Using Power Query Query Diagnostics Chris Webb's BI Blog

Leave a Reply to Jerome Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: