Troubleshooting Power Query performance issues in Power BI and Excel can be difficult because it’s a bit of a black box: there’s nothing in the UI to tell you what’s going on inside the Power Query engine and the diagnostic logs are very difficult to interpret. With relational data source like SQL Server you can use tools like SQL Server Profiler to see the queries that are being run by Power Query, and I blogged recently about using Fiddler to troubleshoot OData performance issues; but what about file-based data sources, which often present the most challenges regarding performance?
Process Monitor, a free tool from Microsoft, allows you to monitor file system activity in real-time and even having spent a limited amount of time using it I can already tell that it can provide a lot of information to help identify performance issues with file-based data sources. Take, for example, the scenario I described in my recent post on improving the performance of merge operations. In that post (which I suggest you read before you carry on) I mentioned that it looked as though the Power Query engine was reading data from one of the source files multiple times and Process Monitor confirms that this indeed the case.
Disclaimer: I am not going to pretend to be an expert on Process Monitor, the Windows file system or the internals of Power Query. The rest of this post contains the results of my experiments and the conclusions I have drawn from them, which may or may not be correct!
There are a lot of resources online describing Process Monitor (I also took a look at this book, which is very helpful, as is the help file) and it’s also fairly intuitive, so I’m not going to attempt to explain the basics of how to use it, but with a bit of trial-and-error I set up a simple filter consisting of:
- Operation is CreateFile
- Operation is CloseFile
- Path is <the full path of the main csv file used as the data source>
…to watch what happened when the unoptimised query refreshed:
Some things to note about the output shown above:
- I had the “Enable data preview to download in the background” option turned off just to make sure I only saw activity relating to the refresh
- For the Power Query engine it seems as though the Process Name is Microsoft.MashupContainer.NetFX40.exe.
- There are obvious pairs of CreateFile/CloseFile operations
- Some of these pairs, where the detail for the CreateFile operation is “Desired Access: Read Attributes”, seem to be for reading file metadata and are very fast
- The pairs where the CreateFile operation has a detail of “Desired Access: Generic Read”, seem to be where the contents of the file are read (the first of these pairs is highlighted in the screenshot). The Relative Time column shows the time elapsed since Process Monitor started capturing events, and the difference between the Relative Time values for the CreateFile and CloseFile operations in these pairs is around 5-10 seconds.
- It looks like the Power Query engine reads some or all of the data from the csv file six times in total. This tallies roughly with the values shown in the Power BI UI for the amount of data read from disk, although it does not tally with my guess that it was reading the data from the file once for each of the five rows in the output query.
- The difference in the Relative Time values of the first and last events is around 45 seconds; the query itself takes around 55 seconds to run.
With the optimisation described in the blog post (ie doing a “Remove Duplicates” on one of the columns) in place, Process Monitor shows the following instead:
As you can see, it looks like Power Query now only reads from the file twice and the difference in time between the first and last event is now only around 8 seconds.
This is clearly useful: we can now see when Power Query opens a file, how often it opens a file, and how long it holds the file open for each time. It also raises a number of other questions which I can’t answer properly yet:
- Why does Power Query need to open a file multiple times? Actually this isn’t a surprise to anyone who has spent a lot of time with Power Query and has a vague understanding of how it works, but it seems fair to say that the unoptimised query in this example was slow because Power Query had to open and read from the file multiple times.
- What is Power Query doing exactly when it has the file open? Is it reading some or all of the data in the file? What is it doing for the rest of the time?
- Are some file formats inherently slower than others as far as Power Query is concerned? It seems so: for example Excel files seem to be a lot slower than csv files.
So yet more research is required… but still very interesting, don’t you think? If I’ve made any mistakes here, or if you use Process Monitor to see what happens when you run your queries and you find something else useful, please leave a comment below.