Optimising The Performance Of Power Query Merges In Power BI, Part 2: Does When You Remove Columns Matter?

In my last post I demonstrated how the size of a table affects the performance of Power Query merge operations on non-foldable data sources in Power BI. Specifically, I showed that removing columns from the tables involved in a merge before the merge took place improved performance. But does it matter when you remove the columns? Is it enough to only select the columns you need when you expand the nested table returned by a merge, for example, or just to remove columns after the merge step? So, today’s question is:

Does it make a difference to Power Query merge performance if you remove unwanted columns from your source tables in the step before the merge or in the step afterwards?

The first problem I had to deal with when testing was that, as I said in my previous post, even when I reduced the number of columns in the million-row table I was using all my queries were exceeding the 256MB container size limit and paging was taking place. This meant that any refresh timings I took in Profiler varied by up to 1-2 seconds from run to run. When you’re comparing queries that take around 50 seconds with queries that take around 10 seconds a bit of variability in the timings from run to run doesn’t matter much. However, if you’re trying to compare two different test scenarios whose performance may not differ that much anyway it becomes more of a problem.

To deal with this I reduced the number of rows in the tables I was using in my merge to 300,000 rows. This kept container memory usage under 256MB and reduced the variability in timings to a minimum. Here are the average timings across multiple runs for the last scenario tested in my previous post – where I removed all but one column in my source tables before the merge and then counted the rows in the table returned by the merge – with these 300,000 row source tables:

  • Progress Report End/25 Execute SQL – 2.4 seconds
  • Progress Report End/17 Read Data – 0 seconds

I then changed my source queries so they again returned all seven columns and changed the merge query so that it removed all but the two A columns in a Removed Columns step after the step containing the merge, like so:

let
Source = Table.NestedJoin(First, {"A"}, Second, {"A"}, "Second", JoinKind.Inner),
#"Expanded Second" = Table.ExpandTableColumn(
Source,
"Second",
{"A", "B", "C", "D", "E", "F", "G"},
{"Second.A", "Second.B", "Second.C", "Second.D", "Second.E", "Second.F", "Second.G"}
),
#"Removed Other Columns" = Table.SelectColumns(#"Expanded Second", {"A", "Second.A"}),
#"Counted Rows" = Table.RowCount(#"Removed Other Columns")
in
#"Counted Rows"

The average timings for this version of the query were identical to those from the previous test:

  • Progress Report End/25 Execute SQL – 2.4 seconds
  • Progress Report End/17 Read Data – 0 seconds

Last of all, I removed all but column A from the source query called First, did not remove any columns from the source query called Second, and then only expanded column A in the #”Expanded Second” step in the merge query:

let
Source = Table.NestedJoin(First, {"A"}, Second, {"A"}, "Second", JoinKind.Inner),
#"Expanded Second" = Table.ExpandTableColumn(Source, "Second", {"A"}, {"Second.A"}),
#"Counted Rows" = Table.RowCount(#"Expanded Second")
in
#"Counted Rows"

Here are the average timings from this test:

  • Progress Report End/25 Execute SQL – 2.6 seconds
  • Progress Report End/17 Read Data – 0 seconds

These timings were slightly slower, but I don’t think the 200ms increase is really significant.

Therefore, from these tests, I think it’s safe to conclude that:

Removing unwanted columns in a step immediately after a merge results in the same performance benefits as removing unwanted columns in a step immediately before a merge

These results may surprise some of you. Why does removing columns after a merge perform so well? The answer is that just because you remove columns from a table in the step after a merge, it doesn’t mean that the Power Query engine actually removes the columns after it has performed the merge. On the contrary, it’s clever enough to realise that if you remove columns in the step after a merge then there’s no need for those columns to be present at the point where it performs the merge. You probably know how query folding works with certain data sources – well, it turns out that something similar to query folding happens within the Power Query engine whatever data source you’re using. The best explanation of this behaviour – which is fundamental to understanding how the Power Query engine works – can be found in the section on “Streaming Semantics” in this blog post by Ben Gribaudo in his excellent series on M:

https://bengribaudo.com/blog/2018/02/28/4391/power-query-m-primer-part5-paradigm

Of course there may be scenarios where this does not happen, or doesn’t happen efficiently, so I still recommend removing unwanted columns as early as possible in your queries just to be safe.

12 thoughts on “Optimising The Performance Of Power Query Merges In Power BI, Part 2: Does When You Remove Columns Matter?

    1. Interesting you say that – I have never found that appends perform badly. Have you got some examples? It might be worth me looking at.

  1. Thanks for clearing this out Chris. Is there any reason you didn’t define the merge keys before doing the merge operations? Anyway I think this might be a good third part of this blog post since you now can measure the merge operation more precise. If you take the time to do this third part, it would also be interesting to see if defining the primary key as a foreign key does any difference since Microsoft claims the second argument in Table.AddKey are only metadata (for now)

  2. Would a Join Kind impact the performance – For Example
    When comparing two Single Column tables – say LaskWeek and ThisWeek with 1 Column Say ProjID
    to find ProjID in this week not there last week – one way would be to do a Left Outer Join and Exand the merged column and filter out nulls
    The other way would be to do a Left Anti Join there by avoiding the Expand and the Filter out null step

      1. Thanks Chris –

        It would also be intereting to investigate which is faster
        =Table.Distinct(Source)
        or
        = Table.Group(Source,Table.ColumnNames(Source),{})

        and would number of columns play a role in these operations

  3. I read this article with great interest when first studying Power Query and took it to heart. As you have said many times, the calculation engines are constantly being improved. I revisited the article series because I had the opportunity to work with multimillion-row production tables. The project required multiple successive joins, mostly on a set of integer GUIDs with a couple text fields thrown in.
    While I did not perform timings, I can tell you stripping unnecessary fields from right side tables in all joins cut total PBI Desktop and Service processing time in half. I know, I know, “your environment” blah blah blah. Import mode from Dataflow queries, PPU account.
    I did this to one table consisting of only two fields. the integer GUID and its corresponding integer raw value. Inside the NestedJoin I nested SelectColumns to define the right side table and retrieved both fields. It ran considerably faster than retrieving the two-field table.
    I suspect at multimillion-row scale column count matters even when they number in single digits, more so as the right side table’s column count increases. You may want to consider revisiting this question. Thanks.

Leave a Reply to Pär AdeenCancel reply