Power BI · Power Query

Optimising The Performance Of Power Query Merges In Power BI, Part 2: Does When You Remove Columns Matter?

In my last post I demonstrated how the size of a table affects the performance of Power Query merge operations on non-foldable data sources in Power BI. Specifically, I showed that removing columns from the tables involved in a merge before the merge took place improved performance. But does it matter when you remove the columns? Is it enough to only select the columns you need when you expand the nested table returned by a merge, for example, or just to remove columns after the merge step? So, today’s question is:

Does it make a difference to Power Query merge performance if you remove unwanted columns from your source tables in the step before the merge or in the step afterwards?

The first problem I had to deal with when testing was that, as I said in my previous post, even when I reduced the number of columns in the million-row table I was using all my queries were exceeding the 256MB container size limit and paging was taking place. This meant that any refresh timings I took in Profiler varied by up to 1-2 seconds from run to run. When you’re comparing queries that take around 50 seconds with queries that take around 10 seconds a bit of variability in the timings from run to run doesn’t matter much. However, if you’re trying to compare two different test scenarios whose performance may not differ that much anyway it becomes more of a problem.

To deal with this I reduced the number of rows in the tables I was using in my merge to 300,000 rows. This kept container memory usage under 256MB and reduced the variability in timings to a minimum. Here are the average timings across multiple runs for the last scenario tested in my previous post – where I removed all but one column in my source tables before the merge and then counted the rows in the table returned by the merge – with these 300,000 row source tables:

  • Progress Report End/25 Execute SQL – 2.4 seconds
  • Progress Report End/17 Read Data – 0 seconds

I then changed my source queries so they again returned all seven columns and changed the merge query so that it removed all but the two A columns in a Removed Columns step after the step containing the merge, like so:

let
Source = Table.NestedJoin(First, {"A"}, Second, {"A"}, "Second", JoinKind.Inner),
#"Expanded Second" = Table.ExpandTableColumn(
Source,
"Second",
{"A", "B", "C", "D", "E", "F", "G"},
{"Second.A", "Second.B", "Second.C", "Second.D", "Second.E", "Second.F", "Second.G"}
),
#"Removed Other Columns" = Table.SelectColumns(#"Expanded Second", {"A", "Second.A"}),
#"Counted Rows" = Table.RowCount(#"Removed Other Columns")
in
#"Counted Rows"

The average timings for this version of the query were identical to those from the previous test:

  • Progress Report End/25 Execute SQL – 2.4 seconds
  • Progress Report End/17 Read Data – 0 seconds

Last of all, I removed all but column A from the source query called First, did not remove any columns from the source query called Second, and then only expanded column A in the #”Expanded Second” step in the merge query:

let
Source = Table.NestedJoin(First, {"A"}, Second, {"A"}, "Second", JoinKind.Inner),
#"Expanded Second" = Table.ExpandTableColumn(Source, "Second", {"A"}, {"Second.A"}),
#"Counted Rows" = Table.RowCount(#"Expanded Second")
in
#"Counted Rows"

Here are the average timings from this test:

  • Progress Report End/25 Execute SQL – 2.6 seconds
  • Progress Report End/17 Read Data – 0 seconds

These timings were slightly slower, but I don’t think the 200ms increase is really significant.

Therefore, from these tests, I think it’s safe to conclude that:

Removing unwanted columns in a step immediately after a merge results in the same performance benefits as removing unwanted columns in a step immediately before a merge

These results may surprise some of you. Why does removing columns after a merge perform so well? The answer is that just because you remove columns from a table in the step after a merge, it doesn’t mean that the Power Query engine actually removes the columns after it has performed the merge. On the contrary, it’s clever enough to realise that if you remove columns in the step after a merge then there’s no need for those columns to be present at the point where it performs the merge. You probably know how query folding works with certain data sources – well, it turns out that something similar to query folding happens within the Power Query engine whatever data source you’re using. The best explanation of this behaviour – which is fundamental to understanding how the Power Query engine works – can be found in the section on “Streaming Semantics” in this blog post by Ben Gribaudo in his excellent series on M:

https://bengribaudo.com/blog/2018/02/28/4391/power-query-m-primer-part5-paradigm

Of course there may be scenarios where this does not happen, or doesn’t happen efficiently, so I still recommend removing unwanted columns as early as possible in your queries just to be safe.

11 thoughts on “Optimising The Performance Of Power Query Merges In Power BI, Part 2: Does When You Remove Columns Matter?

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      Interesting you say that – I have never found that appends perform badly. Have you got some examples? It might be worth me looking at.

      1. Thanks, I can’t reproduce it now. Do you know if they upgraded the PQ engine in the last year?

      2. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
        Chris Webb says:

        They upgrade it all the time, so I guess whatever problem you were running into has now been solved.

  1. Thanks for clearing this out Chris. Is there any reason you didn’t define the merge keys before doing the merge operations? Anyway I think this might be a good third part of this blog post since you now can measure the merge operation more precise. If you take the time to do this third part, it would also be interesting to see if defining the primary key as a foreign key does any difference since Microsoft claims the second argument in Table.AddKey are only metadata (for now)

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      Key were already on my list of things to investigate 🙂

  2. Would a Join Kind impact the performance – For Example
    When comparing two Single Column tables – say LaskWeek and ThisWeek with 1 Column Say ProjID
    to find ProjID in this week not there last week – one way would be to do a Left Outer Join and Exand the merged column and filter out nulls
    The other way would be to do a Left Anti Join there by avoiding the Expand and the Filter out null step

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      That’s also on my list of things to investigate, yes.

      1. Thanks Chris –

        It would also be intereting to investigate which is faster
        =Table.Distinct(Source)
        or
        = Table.Group(Source,Table.ColumnNames(Source),{})

        and would number of columns play a role in these operations

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.