Can You Use Power Query In Power BI Or Excel To Write Data To A Data Source?

The short answer: no.

The long answer:

It is, technically, possible to use Power Query (in either Power BI or Excel) to write data to a data source; for example, here’s an old post I wrote showing how to run a UPDATE statement to update data in a table in SQL Server. However because the Power Query engine may evaluate a value or statement more than once when a query is executed, it is therefore not safe or supported to use it to try to create, insert, update, delete or otherwise change data in a data source – it should only be used to read data. You may see some functions listed in the M function reference that suggest otherwise but they aren’t supported either, and indeed may not work at all.

Thanks to Matt Masson of the dev team for providing the official position on this.

Comparing The Performance Of CSV And Excel Data Sources In Power Query

My posts from two weeks ago (see here and here) on using Process Monitor to troubleshoot the performance of Power Query queries made me wonder about another question: how does the performance of reading data from CSV files compare to the performance of reading data from Excel files? I think most experienced Power Query users in either Power BI or Excel know that Excel data sources perform pretty badly but I had never done any proper tests. I’m not going to pretend that this post is a definitive answer to this question (and once again, I would be interested to hear your experiences) but hopefully it will be thought-provoking.

To start off, I took the 153.6MB CSV file used in my last few posts and built a simple query that applied a filter on one text column, then removed all but three columns. The query ran in 9 seconds and using the technique from my last post I was able to draw the following graph from a Process Monitor log file showing how Power Query reads data from the file:

image

Nothing very surprising there. Then I opened the same CSV file in Excel and saved the data as an xlsx file with one worksheet and no tables or named ranges; the resulting file was only 80.6MB. Finally I created a duplicate of the first query and pointed it to the Excel file instead. The resulting query ran in 59 seconds – around 6 times slower! Here’s a comparison with the performance of this query with the first query:

image

The black line in the graph above is the amount of data read (actually the offset values showing where in the file the data is read from, which is the same thing as a running total when Power Query is reading all the data) from the Excel file; the green line is the amount of data read from the CSV file (the same data shown in the first graph above). A few things to mention:

  • Running Process Monitor while this second query was refreshing had a noticeable impact on its performance – in fact it was almost 20 seconds slower
  • The initial values of 80 million bytes seem to be where data is read from the end of the Excel file. Maybe this is Power Query reading some file metadata? Anyway, it seems as though it takes 5 seconds before it starts to read the data needed by the query.
  • There’s a plateau between the 10 and 20 second mark where not much is happening; this didn’t happen consistently and may have been connected to the fact that Process Monitor was running

In any case you don’t need to study this too hard to understand that the performance of reading data from an xlsx format Excel file is terrible compared to the performance of reading data from a CSV. So, if you have a choice between these two formats, for example when downloading data, it seems fair to say that you should always choose CSV over xlsx.

Using Process Monitor To Find Out How Much Data Power Query Reads From A File

This post is really just a quick follow-on from my post earlier this week on using Process Monitor to troubleshoot Power Query performance issues with file-based data sources, which I suggest you read before carrying on. I realised, after playing around with Process Monitor some more, that the ReadFile operation actually tells you how much data is being read from a file when a Power Query query is running. For example, here’s a sample of some of the ReadFile operations captured while running the unoptimised version of the query I talked about in my last post:

image

Since Process Monitor can export captured events to a CSV file, it’s pretty easy to load the events into Power BI, filter the events down to only the ReadFile operations, parse the Detail column to extract the Offset values (which I’m sure you can work out how to do if you’re reading a post like this), and then draw a graph showing how much data gets read from a file when a query is run. Here’s what the graph looks like for the unoptimised version of the query from my previous blog post, with relative time on the x axis and the amount of data read  in bytes on the y axis:

image

In that post I noted that there were six reads of the file – and while that’s clear from the graph above, it’s also possible to see that the first read does not read the whole contents of the file while the next five do (the file is 149MB). So maybe I was right that there is one complete read of the file for each row in the output query? What is that first, partial read for, I wonder?

Using Process Monitor To Troubleshoot Power Query Performance Issues With File Data Sources

Troubleshooting Power Query performance issues in Power BI and Excel can be difficult because it’s a bit of a black box: there’s nothing in the UI to tell you what’s going on inside the Power Query engine and the diagnostic logs are very difficult to interpret. With relational data source like SQL Server you can use tools like SQL Server Profiler to see the queries that are being run by Power Query, and I blogged recently about using Fiddler to troubleshoot OData performance issues; but what about file-based data sources, which often present the most challenges regarding performance?

Process Monitor, a free tool from Microsoft, allows you to monitor file system activity in real-time and even having spent a limited amount of time using it I can already tell that it can provide a lot of information to help identify performance issues with file-based data sources. Take, for example, the scenario I described in my recent post on improving the performance of merge operations. In that post (which I suggest you read before you carry on) I mentioned that it looked as though the Power Query engine was reading data from one of the source files multiple times and Process Monitor confirms that this indeed the case.

Disclaimer: I am not going to pretend to be an expert on Process Monitor, the Windows file system or the internals of Power Query. The rest of this post contains the results of my experiments and the conclusions I have drawn from them, which may or may not be correct!

There are a lot of resources online describing Process Monitor (I also took a look at this book, which is very helpful, as is the help file) and it’s also fairly intuitive, so I’m not going to attempt to explain the basics of how to use it, but with a bit of trial-and-error I set up a simple filter consisting of:

  • Operation is CreateFile
  • Operation is CloseFile
  • Path is <the full path of the main csv file used as the data source>

image

…to watch what happened when the unoptimised query refreshed:

image

Some things to note about the output shown above:

  • I had the “Enable data preview to download in the background” option turned off just to make sure I only saw activity relating to the refresh
  • For the Power Query engine it seems as though the Process Name is Microsoft.MashupContainer.NetFX40.exe.
  • There are obvious pairs of CreateFile/CloseFile operations
  • Some of these pairs, where the detail for the CreateFile operation is “Desired Access: Read Attributes”, seem to be for reading file metadata and are very fast
  • The pairs where the CreateFile operation has a detail of “Desired Access: Generic Read”, seem to be where the contents of the file are read (the first of these pairs is highlighted in the screenshot). The Relative Time column shows the time elapsed since Process Monitor started capturing events, and the difference between the Relative Time values for the CreateFile and CloseFile operations in these pairs is around 5-10 seconds.
  • It looks like the Power Query engine reads some or all of the data from the csv file six times in total. This tallies roughly with the values shown in the Power BI UI for the amount of data read from disk, although it does not tally with my guess that it was reading the data from the file once for each of the five rows in the output query.
  • The difference in the Relative Time values of the first and last events is around 45 seconds; the query itself takes around 55 seconds to run.

With the optimisation described in the blog post (ie doing a “Remove Duplicates” on one of the columns) in place, Process Monitor shows the following instead:

image

As you can see, it looks like Power Query now only reads from the file twice and the difference in time between the first and last event is now only around 8 seconds.

This is clearly useful: we can now see when Power Query opens a file, how often it opens a file, and how long it holds the file open for each time. It also raises a number of other questions which I can’t answer properly yet:

  • Why does Power Query need to open a file multiple times? Actually this isn’t a surprise to anyone who has spent a lot of time with Power Query and has a vague understanding of how it works, but it seems fair to say that the unoptimised query in this example was slow because Power Query had to open and read from the file multiple times.
  • What is Power Query doing exactly when it has the file open? Is it reading some or all of the data in the file? What is it doing for the rest of the time?
  • Are some file formats inherently slower than others as far as Power Query is concerned? It seems so: for example Excel files seem to be  a lot slower than csv files.

So yet more research is required… but still very interesting, don’t you think? If I’ve made any mistakes here, or if you use Process Monitor to see what happens when you run your queries and you find something else useful, please leave a comment below.

Combining Data From Multiple Worksheets In The Same Excel Workbook Using Power BI

It’s very common that you need to combine data from multiple worksheets in the same Excel workbook when you’re using Power BI or Power Query/Get&Transform in Excel. Indeed a lot of people have blogged about how to solve this problem, but none of the solutions I’ve found on the internet work in more complex scenarios when the data on each sheet needs some kind of transformation before it can be combined. I was asked to explain how to do this recently while teaching a Power BI class, so in this blog post I’m going to walk through a worked example and point out a few issues that might trip up even experienced Power BI users.

First of all, the source data. Let’s say you have an Excel workbook with four worksheets: Q1, Q2, Q3 and Q4. On each worksheet is some sales data for the three months in each quarter; for example the Q1 worksheet looks like this:

image

…the Q2 worksheet looks like this:

image

…and so on. The required output for Power BI should be a table that looks like this:

image

Now most of the blog posts that describe this problem, such as Ken Puls’s post here, assume each worksheet has a table with the same column names on it. If each sheet has the same columns, this means you can just connect to the Excel workbook and get a table containing the contents (Miguel Escobar has a great post describing how to do this here) and then click the Expand/Aggregate button:

image

However in this particular case it doesn’t solve the problem, because we get this:

image

Aha, you may say, we have to transform the data before we can combine it and so we need to create a function and call it for every worksheet – the technique I’ve already blogged about here. And yes, that is basically what needs to happen, but the devil’s in the detail.

Here’s the solution, step-by-step:

Step 1: Get a table with all the worksheets listed

In Power BI connect to your Excel file as normal, then in the Navigator pane right-click on the name of the Excel workbook and select Edit rather than selecting any of the individual worksheets:

image

The result will be a table that looks something like this:

image

If you need to, filter out any rows that do not contain “Sheet” in the Kind column and also filter out any worksheets that you don’t want to combine data from.

Step 2: Create your template query

Duplicate the query above and call the new query Template.

Now, in the Template query, select one of the worksheets to use to build the query whose logic will be applied to all the other worksheets, and filter the table above so it only contains the row for that worksheet. In this case I’m using the worksheet called Q1:

image

Then – and this is important – remove all the other columns in the table except the Data column:

image

Doing this changes the M code generated for the next thing you’ll do; removing all these columns changes the way the row is referenced (see the section on “The effect of primary keys” in this post) and makes sure the name of the worksheet won’t be hard-coded anywhere.

After that click the Table link inside the cell, and you’ll see the contents of the worksheet:

image

There will probably be a Changed Type step in the query that sets the data types for each of the columns, and you will need to delete it:

image

You can now perform any other transformations you need on this query, but you will need to avoid any transformations that generate M code referring to any columns on the original worksheet that aren’t present on other worksheets. Remember, these transformations will need to be applied to the other worksheets and they will fail if they refer to columns that aren’t present – this is why you had to delete the Changed Type step earlier, because it sets the types on the January, February and March columns, and you’ll probably need to delete any other Changed Type steps that are created elsewhere in the query. Open up the Advanced Editor and check the M code for the whole query just to be sure.

In this case all I need to do is unpivot the month columns  by selecting the Product column and using the Unpivot Other Columns button on the Transform tab, and then renaming the columns appropriately:

image

Step 3: Create a function

Next you need to create a new parameter by clicking the Manage Parameters/New Parameter button, call the parameter Worksheet, set the data type to text and have it return the name of the worksheet you chose in the previous step:

image

Now, go back to the Template query, find the step called Filtered Rows towards the beginning where you filtered down to a single worksheet, and click the gear icon next to the step to edit it:

image

Then, edit the step so it uses the value returned by the parameter to filter by instead of the hard-coded value you entered earlier. To do this, click on the button shown below, select Parameter and then select the Worksheet parameter in the next dropdown box along:

image

Finally, go to the Queries pane on the left-hand side of the screen and right-click on the Template query and select Create Function..

image

You’ll be prompted to give the new function a name; call it GetData:

image

Step 4: Invoke the function and combine the data

Finally, go back to the duplicate copy of the original query created at the beginning of step 2. Then go to the Add Column tab on the ribbon and click the Invoke Custom Function button and invoke the GetData function, passing in the contents of the Name column to the function’s only parameter:

image

Last of all, click the Expand/Aggregate button on the new column and expand the nested tables:

image

After removing any unnecessary columns, you’ll see the data from all the worksheets combined into a single table as desired:

image

Don’t forget to set the data types on each column.

You can download the Excel workbook used in this post here and the sample Power BI Desktop file here.

More Details On Creating Tables In Power BI/Power Query M Code Using #table()

About two years ago I wrote a blog post describing how the #table M function can be used to generate tables, but in that post I only covered the functionality I used regularly – namely using #table with a list of column names or a table type in the first parameter. However there two other variations on #table that I have used recently that I thought were worth pointing out.

For example, if you need to generate a table with a set number of columns but you don’t care what the columns are called, you can use an integer in the first parameter to get a table with that number of columns. The following expression returns a table with four columns of data type Any called Column1, Column2, Column3 and Column3, and no rows:

#table(4,{})

image

Also, if you have a list of lists with an unknown number of items in and you want to use each nested list for the row values in a table, you can use a null value in the first parameter of #table. The following expression returns a table with four columns like the one above, but with two rows of integer values:

#table(null, {{1,2,3,4},{2,3,4,5}})

image

How To Tell Whether Query Folding Is Taking Place When Importing Data From Analysis Services In Power BI And Excel Power Query

As a quick follow-on from last week’s post on how to detect whether query folding is taking place when importing from OData data sources, if you’re importing data from Analysis Services you have a similar problem: how do you know whether query folding is taking place? Ensuring that query folding takes place for as many of the steps in your query – especially those that filter or otherwise reduce the amount of data returned – is very important for data refresh performance.

Although the Power Query engine generates MDX queries when importing from Analysis Services in the same way it generates SQL queries when it imports from a relational database, the View Native Query option doesn’t work for Analysis Services data sources. You can of course use a Profiler trace or xEvents to see the MDX, but for most users that will not be an option for security reasons. The UI does tell you in another way though. If query folding is taking place for a step, then the Cube Tools menu on the ribbon (with the Add Items and Collapse Columns buttons) will be available, and in the top right-hand corner of the table in the results area there will be a cube icon:

image

If query folding is not taking place for a step (even though it might be taking place for previous steps in the query) then the Cube Tools menu will not be visible, and the cube icon will be replaced by a table icon. For example, in the following screenshot an Index column has been added to the query shown above, so query folding is not taking place from this step on:

image

[Thanks to Jure Jaklic for pointing this out]

%d bloggers like this: