Optimising The Performance Of Power Query Merges In Power BI, Part 3: Table.Join And SortMerge

In the last two posts in this series I showed how removing columns from the tables used in a Power Query merge operation can improve refresh performance. In this post I’ll show you a different trick that – when you can use it – can give you an equally significant performance boost.

When you merge data from two queries in the Power Query Editor the M code generated uses the Table.NestedJoin function. There is, however, another M function that can be used to merge data: Table.Join. The interesting thing about this function is that has a parameter that Table.NestedJoin doesn’t have: the joinAlgorithm parameter allows you to specify the algorithm used by the Power Query engine for the merge. The documentation doesn’t tell you what the possible values for this parameter are but #shared shows them:

Algs

The SortMerge algorithm, last in the list above, is the focus of this blog post. I mentioned in my earlier posts that the reason that merge operations on non-foldable data sources are often slow is that both of the tables used in the merge need to be held in memory. There is an exception though: if you know that the data in the columns used to join the two tables is sorted in ascending order, you can use the Table.Join function and the SortMerge algorithm and the data from both sources can be streamed rather than held in memory, which in turn results in the merge being much faster.

Here’s an example. As before there are two source queries that take data from a one million row CSV file, but for this test no columns are removed and there’s no filter on rows. The two source queries, called First and Second, are almost identical (in First the columns are named A1 to G1 and in Second the columns are named A2 to G2) and are not loaded into the dataset. Here’s the M code for the query called First:


let
Source = Csv.Document(
File.Contents("C:\Users\chwebb\Documents\NumbersMoreColumns.csv"),
[Delimiter = ",", Columns = 7, Encoding = 65001, QuoteStyle = QuoteStyle.None]
),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
#"Changed Type" = Table.TransformColumnTypes(
#"Promoted Headers",
{
{"A", Int64.Type},
{"B", Int64.Type},
{"C", Int64.Type},
{"D", Int64.Type},
{"E", Int64.Type},
{"F", Int64.Type},
{"G", Int64.Type}
}
),
#"Renamed Columns" = Table.RenameColumns(
#"Changed Type",
{{"A", "A1"}, {"B", "B1"}, {"C", "C1"}, {"D", "D1"}, {"E", "E1"}, {"F", "F1"}, {"G", "G1"}}
)
in
#"Renamed Columns"

First

Here’s a query that merges these queries using Table.NestedJoin and returns all columns from the source queries and is enabled:


let
Source = Table.NestedJoin(First, {"A1"}, Second, {"A2"}, "Second", JoinKind.Inner),
#"Expanded Second" = Table.ExpandTableColumn(
Source,
"Second",
{"A2", "B2", "C2", "D2", "E2", "F2", "G2"},
{"Second.A2", "Second.B2", "Second.C2", "Second.D2", "Second.E2", "Second.F2", "Second.G2"}
)
in
#"Expanded Second"

The timings for refreshing the merge query are:

  • Progress Report End/25 Execute SQL – 54 seconds
  • Progress Report End/17 Read Data – 58 seconds

[As I mentioned before, these timings may vary by a few seconds each way from run to run because paging is taking place]

Now it just so happens that in this case I know the data in the A columns in both the source queries is sorted in ascending order, so this means I can rewrite the merge query using Table.Join and the SortMerge algorithm like so:


let
Source =
Table.Join(
First, {"A1"}, Second, {"A2"},
JoinKind.Inner, JoinAlgorithm.SortMerge
)
in
Source

Here are the timings for this version:

  • Progress Report End/25 Execute SQL – 0.1 seconds
  • Progress Report End/17 Read Data – 21 seconds

This new query is clearly much faster than the original version of the merge!

Now let’s talk about the limitations of this approach. First of all, if the data in the columns used to join the two tables together is not sorted in ascending order you won’t get an error message, you’ll just get incorrect data returned, so you really need to be sure that the data is indeed sorted. Secondly, if your data is not sorted, then you can sort it in Power Query before the merge – but since sorting itself takes time and sorting for non-foldable data sources is another one of those operations which requires the table to be held in memory, you’re unlikely to get any performance improvement.

That said, I can see that there are going to be plenty of scenarios where you can use this technique. For example, if you’re extracting data to a CSV file it may be possible to specify that you want to sort the data when the extract takes place. Also, if you’re trying to merge data from two different data sources (say, SQL Server and Oracle) that both support query folding for sorting, then the time it takes to sort the data may be less than the performance gain from using the SortMerge algorithm.

 

 

11 responses

  1. The M documentation does indeed give you the parameter values—not that I knew how to use them! I’ve always wondered if the Table.Join was faster—bout to play with it right now!

  2. Interesting. I think I have seen this Table.Join before, created by the GUI when expanding related tables from a SQL database (having query folding). At that time I remember the folded SQL code shown by the GUI was the same as if I created a Table.NestedJoin, although the execution time using the Table.Join was faster. But I need to recheck this

    But now to my question. The Table.Join will expand the whole second table. If I don’t want the whole second table to be expanded (just the A column), will Table.Join also be better in this scenario. I.E. only expanding A2 using Table.NestedJoin and only keeping A2 after the Table.Join operation. This is almost the typical case when you want to merge a surrogate key to a fact table

    Also a small comment. The Table.Join function will not work if the first and second table contains columns with identical names, while this is not a problem using Table.NestedJoin

    • Sorry for confusing with my first section. I mixed that up with the usage of Table.ExpandTableColumn and Table.ExpandRecordColumn, not related to this at all :o)

      My curiosity on the mid-section question still remains though

      • Found the answer myself. To only expand a selected column from the second table using a Table.Join, the easiest is to pre-prepare the second table before the merge operation, i.e. a sub-select on the second table as below

        let
        Source =
        Table.Join(
        First, {“A1”}, Table.SelectColumns(Second, {“A2”}), {“A2”},
        JoinKind.Inner, JoinAlgorithm.SortMerge
        )
        in
        Source

        And a rename will be handled the very same way by wrapping a Table.RenameColumns on-top of the Table-SelectColumns. And… If sorting is not pre-prepared, also this can happen by wrapping a Table.Sort on-top of the other table functions

        I guess this will also answer part of SAMs post below concerning time for renaming and selecting columns, since this will be 0 seconds.
        Although the sorting will take time. But in cases where the second table is a small dimension and the first table is a larger fact table, this Table.Join will still be super useful.

        Thanks for a superb blog about this Chris!!!

  3. Pingback: Optimizing Power BI Merge Performance with Table.Join – Curated SQL

  4. Just wanted to drop a note to say that this series of posts has been really useful and very timely. With the information you have provided I have massively improved the performance of my merge queries (I am merging 15 tables – don’t ask!)

  5. Hi Chris
    To make a fair comparison on the timing we should add the timing of the Sort Step, The Rename column step (to ensure the column names are different in both tables) and the Remove column step to remove the unnecessary columns after the Join

    Like the 1 and -1 options of MATCH in excel this must be using a super fast Binary search algorithm that needs data sorted (in ASC / DESC) rather than the linear search of the 0 option (which works on unsorted data)

    • The point is, in this case, that the data is already sorted so no sort step is necessary. Sometimes that is the case. However, as I say in the post, if you need to sort in Power Query then that will probably cancel out any performance improvements from using the SortMerge algorithm. There is no overhead to renaming columns and as I showed in the previous posts in this query, you should remove any unwanted columns before the merge anyway.

      • “If your data is not sorted, then you can sort it in Power Query before the merge – but since sorting itself takes time and sorting for non-foldable data sources is another one of those operations which requires the table to be held in memory, you’re unlikely to get any performance improvement.”

        I haven’t tested (yet), but fwiw for some other programming I’ve done (specifically adding a large array of items to a Scripting.Dictionary in Excel VBA) the time required to import the data into Excel and presort was a small fraction of the efficiency gain I got from adding sorted vs unsorted data. Sorting took seconds, using sorted vs unsorted cut processing time from 25 to 14 seconds. So sorting gave a VERY good ROI.

        Will be interesting to test if this holds true for PQ. I’ll try this on my current data (merging 800k rows with 800k rows) time willing.

  6. Pingback: Optimising The Performance Of Power Query Merges In Power BI, Part 3: Table.Join And SortMerge | Pardaan.com

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: