In my last post I showed how changing the various options on the AzureStorage.DataLake() M function didn’t have much impact on dataset refresh performance in Power BI. I’ll admit I was slightly surprised by this, but it got me wondering why this was – and so I decided to do some tests to find out.
The answer can be found using Power Query’s query diagnostics functionality. Although you can’t use it to find out what happens when a dataset refresh takes place in the Power BI Service, you can use it to view requests to web services for refreshes in Power BI Desktop as I showed in this post. The Detailed diagnostic log query shows each request Power Query makes to get data from the ADLSgen2 API, the urls show the names of the files being accessed, and you can also see how long each request takes, the start and end time of each request and the amount of data read (the Content Length value in the response) amongst other things:
I wrote a Power Query query to extract all this useful information and put it in a more useful format, which can then be shown in Power BI. It’s fairly rough-and-ready but I turned it into an M function and posted the code here if you’d like to try it yourself – I haven’t done any serious testing on it though.
Here’s the data I captured for a refresh in Power BI Desktop that started at 10:55:42am yesterday and ended at 10:57:33am which took 111 seconds overall. I was using the default options for AzureStorage.DataLake() and this table only shows data for the GET requests to the ADLSgen2 API that returned data:
The main thing to notice here is that the total duration of all the requests was just 5.25 seconds – less than 5% of the overall refresh time – which explains why changing the options in AzureStorage.DataLake() didn’t make much difference to dataset refresh performance. Maybe if the files were larger, or there were more of them, changing the options would make a more noticeable impact. Of course there’s a lot more happening inside both the Power Query engine and the Analysis Services engine here beyond calling the web service to get the raw data. I also ran a Profiler trace while this refresh was running (see here for how to do this) and from the point of view of the Analysis Services engine it took 104 seconds to read the data from Power Query: the ExecuteSQL Profiler event took 4.5 seconds and the ReadData event took 99.5 seconds.
Conclusion: getting raw data from ADLSgen2 only represents a small part of the time taken to refresh a dataset that uses ADLSgen2 as a source, so any attempts to tune this may not have much impact on overall refresh times.