Using Process Monitor To Troubleshoot Power Query Performance Issues With File Data Sources

Troubleshooting Power Query performance issues in Power BI and Excel can be difficult because it’s a bit of a black box: there’s nothing in the UI to tell you what’s going on inside the Power Query engine and the diagnostic logs are very difficult to interpret. With relational data source like SQL Server you can use tools like SQL Server Profiler to see the queries that are being run by Power Query, and I blogged recently about using Fiddler to troubleshoot OData performance issues; but what about file-based data sources, which often present the most challenges regarding performance?

Process Monitor, a free tool from Microsoft, allows you to monitor file system activity in real-time and even having spent a limited amount of time using it I can already tell that it can provide a lot of information to help identify performance issues with file-based data sources. Take, for example, the scenario I described in my recent post on improving the performance of merge operations. In that post (which I suggest you read before you carry on) I mentioned that it looked as though the Power Query engine was reading data from one of the source files multiple times and Process Monitor confirms that this indeed the case.

Disclaimer: I am not going to pretend to be an expert on Process Monitor, the Windows file system or the internals of Power Query. The rest of this post contains the results of my experiments and the conclusions I have drawn from them, which may or may not be correct!

There are a lot of resources online describing Process Monitor (I also took a look at this book, which is very helpful, as is the help file) and it’s also fairly intuitive, so I’m not going to attempt to explain the basics of how to use it, but with a bit of trial-and-error I set up a simple filter consisting of:

  • Operation is CreateFile
  • Operation is CloseFile
  • Path is <the full path of the main csv file used as the data source>

image

…to watch what happened when the unoptimised query refreshed:

image

Some things to note about the output shown above:

  • I had the “Enable data preview to download in the background” option turned off just to make sure I only saw activity relating to the refresh
  • For the Power Query engine it seems as though the Process Name is Microsoft.MashupContainer.NetFX40.exe.
  • There are obvious pairs of CreateFile/CloseFile operations
  • Some of these pairs, where the detail for the CreateFile operation is “Desired Access: Read Attributes”, seem to be for reading file metadata and are very fast
  • The pairs where the CreateFile operation has a detail of “Desired Access: Generic Read”, seem to be where the contents of the file are read (the first of these pairs is highlighted in the screenshot). The Relative Time column shows the time elapsed since Process Monitor started capturing events, and the difference between the Relative Time values for the CreateFile and CloseFile operations in these pairs is around 5-10 seconds.
  • It looks like the Power Query engine reads some or all of the data from the csv file six times in total. This tallies roughly with the values shown in the Power BI UI for the amount of data read from disk, although it does not tally with my guess that it was reading the data from the file once for each of the five rows in the output query.
  • The difference in the Relative Time values of the first and last events is around 45 seconds; the query itself takes around 55 seconds to run.

With the optimisation described in the blog post (ie doing a “Remove Duplicates” on one of the columns) in place, Process Monitor shows the following instead:

image

As you can see, it looks like Power Query now only reads from the file twice and the difference in time between the first and last event is now only around 8 seconds.

This is clearly useful: we can now see when Power Query opens a file, how often it opens a file, and how long it holds the file open for each time. It also raises a number of other questions which I can’t answer properly yet:

  • Why does Power Query need to open a file multiple times? Actually this isn’t a surprise to anyone who has spent a lot of time with Power Query and has a vague understanding of how it works, but it seems fair to say that the unoptimised query in this example was slow because Power Query had to open and read from the file multiple times.
  • What is Power Query doing exactly when it has the file open? Is it reading some or all of the data in the file? What is it doing for the rest of the time?
  • Are some file formats inherently slower than others as far as Power Query is concerned? It seems so: for example Excel files seem to be  a lot slower than csv files.

So yet more research is required… but still very interesting, don’t you think? If I’ve made any mistakes here, or if you use Process Monitor to see what happens when you run your queries and you find something else useful, please leave a comment below.

6 responses

  1. Thanks for the information.

    On Tue, Jul 17, 2018, 2:47 AM Chris Webb’s BI Blog wrote:

    > Chris Webb posted: “Troubleshooting Power Query performance issues in > Power BI and Excel can be difficult because it’s a bit of a black box: > there’s nothing in the UI to tell you what’s going on inside the Power > Query engine and the diagnostic logs are very difficult to inte” >

  2. Hi Chris.
    I think that there is at least two things involved:
    1) Power Query tries to use a kind of streaming semantic (or, may be, kind of folding?) with tables (which is not immutable in M) by default. So it can read a relatively big files in a parts.
    2) removing duplicates or adding keys or index or buffering stops this behavior and PQ then does not to try to re-read file

    I understand that my description is not cose to the correct answer, but if we can cast Curt there…, I think he knows the answer.

  3. Thank you Chris,
    This is an amazing tool you are giving us.
    With this tool and your few tips I was able to identify big issues in some queries.
    Please note that when your files lies on a network, you need to modify the filters like:
    – Operation is IRP_MJ_CREATE
    – Operation is IRP_MJ_CLOSE
    and also path needs to be full address:
    – Path Contains \\10.0.0.1\…\…\

  4. Pingback: Using Process Monitor To Find Out How Much Data Power Query Reads From A File « Chris Webb's BI Blog

  5. Pingback: Using Process Monitor To Troubleshoot Power Query Performance Issues With File Data Sources | Pardaan.com

  6. Pingback: Comparing The Performance Of CSV And Excel Data Sources In Power Query « Chris Webb's BI Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: