Optimising The Performance Of Power Query Merges In Power BI, Part 1: Removing Columns

Merging (or, in SQL terms, joining) tables in Power Query is a common cause of refresh performance problems. I’ve often wondered whether there’s anything you can do to optimise the performance of merges for non-foldable data source and so I decided to run some tests to try to answer all the questions I had. In this series of posts I’ll tell you what I found.

For these tests the only data source I used was a CSV file with one million rows and seven numeric columns named A, B C, D, E, F and G:

Csv

I used SQL Server Profiler to measure the amount of time taken for a query to execute using the technique I blogged about here. If you read that post (and I strongly recommend you do) you’ll see there are actually two Profiler events whose duration is significant when measuring refresh performance:

  • Progress Report End/25 Execute SQL
  • Progress Report End/17 Read Data

It also turns out that these two events provide some insight into how certain transformations are handled in the Power Query engine, but we’ll come to that later.

The first question I decided to investigate was this:

Does the number of columns in a table affect the performance of a merge?

First of all, I created two identical queries called First and Second that connected to the CSV file, used the first row from the file as the headers, and set the data types to all seven columns to Whole Number. Nothing very special, but here’s the M code all the same:

let
  Source = Csv.Document(
    File.Contents("C:\NumbersMoreColumns.csv"), 
    [Delimiter = ",", Columns = 7, 
     Encoding = 65001, QuoteStyle = QuoteStyle.None]
  ),
  #"Promoted Headers" = 
   Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Promoted Headers", 
    {
      {"A", Int64.Type}, 
      {"B", Int64.Type}, 
      {"C", Int64.Type}, 
      {"D", Int64.Type}, 
      {"E", Int64.Type}, 
      {"F", Int64.Type}, 
      {"G", Int64.Type}
    }
  )
in
  #"Changed Type"

I disabled these queries so that they were not loaded into the dataset.

Next, I created a third query that used the Table.NestedJoin function to merge the data from these two queries using an inner join and return all of the columns from both source queries:

let
  Source = Table.NestedJoin(
   First, {"A"}, Second, {"A"}, 
   "Second", JoinKind.Inner),
  #"Expanded Second" = Table.ExpandTableColumn(
    Source, 
    "Second", 
    {"A", "B", "C", "D", "E", "F", "G"}, 
    {"Second.A", "Second.B", "Second.C", 
     "Second.D", "Second.E", "Second.F", "Second.G"}
  )
in
  #"Expanded Second"

MergeAllCols

When I refreshed this query, in Profiler the two events I mentioned above had the following durations:

  • Progress Report End/25 Execute SQL – 40 seconds
  • Progress Report End/17 Read Data – 56 seconds

Pretty slow. But what is performance like when you merge two tables with one column instead of seven?

To test this, I added an extra step to the First and Second queries that removed all but the A columns (the ones needed for the merge) like so:

let
  Source = Csv.Document(
    File.Contents("C:\NumbersMoreColumns.csv"), 
    [Delimiter = ",", Columns = 7, 
     Encoding = 65001, QuoteStyle = QuoteStyle.None]
  ),
  #"Promoted Headers" = 
   Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Promoted Headers", 
    {
      {"A", Int64.Type}, 
      {"B", Int64.Type}, 
      {"C", Int64.Type}, 
      {"D", Int64.Type}, 
      {"E", Int64.Type}, 
      {"F", Int64.Type}, 
      {"G", Int64.Type}
    }
  ),
  #"Removed Other Columns" = 
   Table.SelectColumns(#"Changed Type", {"A"})
in
  #"Removed Other Columns"

Just A

I then updated the third query that contained the merge to reflect this change:

let
  Source = Table.NestedJoin(
   First, {"A"}, Second, {"A"}, 
   "Second", JoinKind.Inner),
  #"Expanded Second" = 
   Table.ExpandTableColumn(Source, "Second", {"A"}, {"Second.A"})
in
  #"Expanded Second"

Merge

When this query was refreshed, Profiler showed the following durations:

  • Progress Report End/25 Execute SQL – 9 seconds
  • Progress Report End/17 Read Data – 1 seconds

This query is a lot quicker, but then I thought: what if the performance is more to do with the size of the table returned by the query rather than the merge? So I added an extra step to the end of the merge query, like so:

let
  Source = Table.NestedJoin(
   First, {"A"}, Second, {"A"}, 
   "Second", JoinKind.Inner),
  #"Expanded Second" = 
   Table.ExpandTableColumn(Source, "Second", {"A"}, {"Second.A"}),
  #"Counted Rows" = Table.RowCount(#"Expanded Second")
in
  #"Counted Rows"

Rowcount

…and then reran the two tests above. My thinking was that now the merge query only returns a single value the amount of data returned by the query should not be a factor in the duration of the queries.

Here are the timings for the version with the merge on the tables with seven columns:

  • Progress Report End/25 Execute SQL – 56 seconds
  • Progress Report End/17 Read Data – 0 seconds

Here are the timings for the version with the merge on the tables with just one column:

  • Progress Report End/25 Execute SQL – 14 seconds
  • Progress Report End/17 Read Data – 0 seconds

This does seem to confirm that the number of columns in a table affects the performance of a merge, although of course it might be that it takes longer to count the rows of a table that has more columns.

This shows something else too: Read Data is instant in both cases, compared to the first two tests where it took longer than the Execute SQL events.

Why does the number of columns influence the performance of a merge? If you read my recent post on monitoring memory usage with Query Diagnostics, you’ll remember that merges have to take place in memory – so I guess the larger the tables involved in the merge, the more memory is needed and the more paging happens if the 256MB limit is exceeded. Looking at the performance counter data generated for the last two queries showed that the 256MB limit was indeed exceeded for both the last two queries above, but while the version joining the table with two columns had a maximim commit of 584MB the maximum commit for the version joining the table with seven columns was almost 3GB.

That’s enough for now. At this point I think it’s fair to say the following:

Removing any unwanted columns before you merge two tables in Power Query will improve refresh performance.

11 responses

  1. Hi Chris, Thanks a lot for your blog posts which are always very insightful.
    On such a painful issue as the performance of Merge i think it would be great to have an Official explanation by the development Team itself also) and i am writing this comment for them.

  2. Pingback: Optimizing Power Query Merges – Curated SQL

  3. Pingback: Chris Webb's BI Blog: Optimising The Performance Of Power Query Merges In Power BI, Part 2 Chris Webb's BI Blog

  4. Thanks for this article Chris. Question about it. Regarding removing unwanted columns. If you have a lot of merges (let’s say greater than three), is it better to remove the columns that were needed to merge (and not needed) at the end of all the merges or have multiple remove columns steps that would occur right have the merge was complete?

    • Good question. My instinct says you should remove all the columns you don’t need as soon as you don’t need them. So if, after one of these merges, you no longer need a column you should remove it. But it would be interesting to test this.

    • Another way of phrase this question:
      If I’m removing columns from a table, would ‘removing’ one column at a time be notably slower than removing multiple columns in a single step?

      Example:
      Let
      One_at_a_Time = Table.SelectColumns(First, {“A”, “B”, “C”, “D”, “E”}),
      One_at_a_Time_2 = Table.SelectColumns(First, {“A”, “B”, “C”, “D”}),
      One_at_a_Time_3 = Table.SelectColumns(First, {“A”, “B”, “C”}),
      One_at_a_Time_4 = Table.SelectColumns(First, {“A”, “B”}),
      One_at_a_Time_5 = Table.SelectColumns(First, {“A”})
      in
      One_at_a_Time_5

      Verse:
      Let
      All_at_once = Table.SelectColumns(First, {“A”})
      in
      All_at_once

      My guess is; No.
      Assuming my guess is correct – that removing as ‘early’ as possible will probably speed it up — and might make your code Human harder to read.
      As the above example shows, All_at_once is preferable, since it’s easier to read.

      • I’m pretty sure that removing columns in separate steps like this will perform the same as removing them in a single step too.

  5. Chris, how do you know the slowness isn’t caused by Table.ExpandTableColumn having more work to do with 1 column verses 7 columns?

  6. Pingback: Chris Webb's BI Blog: Optimising The Performance Of Power Query Merges In Power BI, Part 5: Cross Joins Chris Webb's BI Blog

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: