Optimising The Performance Of Power Query Merges In Power BI, Part 3: Table.Join And SortMerge

In the last two posts in this series I showed how removing columns from the tables used in a Power Query merge operation can improve refresh performance. In this post I’ll show you a different trick that – when you can use it – can give you an equally significant performance boost.

When you merge data from two queries in the Power Query Editor the M code generated uses the Table.NestedJoin function. There is, however, another M function that can be used to merge data: Table.Join. The interesting thing about this function is that has a parameter that Table.NestedJoin doesn’t have: the joinAlgorithm parameter allows you to specify the algorithm used by the Power Query engine for the merge. The documentation doesn’t tell you what the possible values for this parameter are but #shared shows them:

Algs

The SortMerge algorithm, last in the list above, is the focus of this blog post. I mentioned in my earlier posts that the reason that merge operations on non-foldable data sources are often slow is that both of the tables used in the merge need to be held in memory. There is an exception though: if you know that the data in the columns used to join the two tables is sorted in ascending order, you can use the Table.Join function and the SortMerge algorithm and the data from both sources can be streamed rather than held in memory, which in turn results in the merge being much faster.

Here’s an example. As before there are two source queries that take data from a one million row CSV file, but for this test no columns are removed and there’s no filter on rows. The two source queries, called First and Second, are almost identical (in First the columns are named A1 to G1 and in Second the columns are named A2 to G2) and are not loaded into the dataset. Here’s the M code for the query called First:

let
Source = Csv.Document(
File.Contents("C:\Users\chwebb\Documents\NumbersMoreColumns.csv"),
[Delimiter = ",", Columns = 7, Encoding = 65001, QuoteStyle = QuoteStyle.None]
),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
#"Changed Type" = Table.TransformColumnTypes(
#"Promoted Headers",
{
{"A", Int64.Type},
{"B", Int64.Type},
{"C", Int64.Type},
{"D", Int64.Type},
{"E", Int64.Type},
{"F", Int64.Type},
{"G", Int64.Type}
}
),
#"Renamed Columns" = Table.RenameColumns(
#"Changed Type",
{{"A", "A1"}, {"B", "B1"}, {"C", "C1"}, {"D", "D1"}, {"E", "E1"}, {"F", "F1"}, {"G", "G1"}}
)
in
#"Renamed Columns"
First

Here’s a query that merges these queries using Table.NestedJoin and returns all columns from the source queries and is enabled:

let
Source = Table.NestedJoin(First, {"A1"}, Second, {"A2"}, "Second", JoinKind.Inner),
#"Expanded Second" = Table.ExpandTableColumn(
Source,
"Second",
{"A2", "B2", "C2", "D2", "E2", "F2", "G2"},
{"Second.A2", "Second.B2", "Second.C2", "Second.D2", "Second.E2", "Second.F2", "Second.G2"}
)
in
#"Expanded Second"

The timings for refreshing the merge query are:

  • Progress Report End/25 Execute SQL – 54 seconds
  • Progress Report End/17 Read Data – 58 seconds

[As I mentioned before, these timings may vary by a few seconds each way from run to run because paging is taking place]

Now it just so happens that in this case I know the data in the A columns in both the source queries is sorted in ascending order, so this means I can rewrite the merge query using Table.Join and the SortMerge algorithm like so:

let
Source =
Table.Join(
First, {"A1"}, Second, {"A2"},
JoinKind.Inner, JoinAlgorithm.SortMerge
)
in
Source

Here are the timings for this version:

  • Progress Report End/25 Execute SQL – 0.1 seconds
  • Progress Report End/17 Read Data – 21 seconds

This new query is clearly much faster than the original version of the merge!

Now let’s talk about the limitations of this approach. First of all, if the data in the columns used to join the two tables together is not sorted in ascending order you won’t get an error message, you’ll just get incorrect data returned, so you really need to be sure that the data is indeed sorted. Secondly, if your data is not sorted, then you can sort it in Power Query before the merge – but since sorting itself takes time and sorting for non-foldable data sources is another one of those operations which requires the table to be held in memory, you’re unlikely to get any performance improvement.

That said, I can see that there are going to be plenty of scenarios where you can use this technique. For example, if you’re extracting data to a CSV file it may be possible to specify that you want to sort the data when the extract takes place. Also, if you’re trying to merge data from two different data sources (say, SQL Server and Oracle) that both support query folding for sorting, then the time it takes to sort the data may be less than the performance gain from using the SortMerge algorithm.

31 thoughts on “Optimising The Performance Of Power Query Merges In Power BI, Part 3: Table.Join And SortMerge

  1. The M documentation does indeed give you the parameter values—not that I knew how to use them! I’ve always wondered if the Table.Join was faster—bout to play with it right now!

  2. Interesting. I think I have seen this Table.Join before, created by the GUI when expanding related tables from a SQL database (having query folding). At that time I remember the folded SQL code shown by the GUI was the same as if I created a Table.NestedJoin, although the execution time using the Table.Join was faster. But I need to recheck this

    But now to my question. The Table.Join will expand the whole second table. If I don’t want the whole second table to be expanded (just the A column), will Table.Join also be better in this scenario. I.E. only expanding A2 using Table.NestedJoin and only keeping A2 after the Table.Join operation. This is almost the typical case when you want to merge a surrogate key to a fact table

    Also a small comment. The Table.Join function will not work if the first and second table contains columns with identical names, while this is not a problem using Table.NestedJoin

    1. Sorry for confusing with my first section. I mixed that up with the usage of Table.ExpandTableColumn and Table.ExpandRecordColumn, not related to this at all :o)

      My curiosity on the mid-section question still remains though

      1. Found the answer myself. To only expand a selected column from the second table using a Table.Join, the easiest is to pre-prepare the second table before the merge operation, i.e. a sub-select on the second table as below

        let
        Source =
        Table.Join(
        First, {“A1”}, Table.SelectColumns(Second, {“A2”}), {“A2”},
        JoinKind.Inner, JoinAlgorithm.SortMerge
        )
        in
        Source

        And a rename will be handled the very same way by wrapping a Table.RenameColumns on-top of the Table-SelectColumns. And… If sorting is not pre-prepared, also this can happen by wrapping a Table.Sort on-top of the other table functions

        I guess this will also answer part of SAMs post below concerning time for renaming and selecting columns, since this will be 0 seconds.
        Although the sorting will take time. But in cases where the second table is a small dimension and the first table is a larger fact table, this Table.Join will still be super useful.

        Thanks for a superb blog about this Chris!!!

      2. Thanks for your follow up on this Par. I Have been researching this to find a way to use Table.Join and not expand the entire table. The issue with Table.NestedJoin is that is doesn’t fold and there is zero to none documentation on the microsoft site. Thanks!

      3. Hey Folks! I saw some interesting thing in your post and 13 seconds after make my self little confused. Its possible to use Table.SelectColumns( to retrieve only the column that I need on the second?

        For example

        Table.Join( fact_table{“Year_Month”}
        ,Backlog,”Year_Month”}
        ,JoinKind.LeftOuter )

        How could I retrieve only the {“backlog_status”} from Backlog tbl?

        Ps some”Year_Month” rows are null

  3. Just wanted to drop a note to say that this series of posts has been really useful and very timely. With the information you have provided I have massively improved the performance of my merge queries (I am merging 15 tables – don’t ask!)

    1. Folks, you do not have to rename the join column if you use an inner join. If you are inner joining on ID No, it will give you the joined tables with just one join column. Also, you can join only the columns you want to use, just by using the right syntax. So instead of your second table being Table2, {ID No}, you can write

      = Table.Join(Table1, {ID No}, Table2[[ID No], [Column2]], {ID No}, JoinKind.Inner, JoinAlgorithm.SortMerge)

      Now you don’t have to rename anything or remove any columns.

      —Nate

  4. Hi Chris
    To make a fair comparison on the timing we should add the timing of the Sort Step, The Rename column step (to ensure the column names are different in both tables) and the Remove column step to remove the unnecessary columns after the Join

    Like the 1 and -1 options of MATCH in excel this must be using a super fast Binary search algorithm that needs data sorted (in ASC / DESC) rather than the linear search of the 0 option (which works on unsorted data)

    1. The point is, in this case, that the data is already sorted so no sort step is necessary. Sometimes that is the case. However, as I say in the post, if you need to sort in Power Query then that will probably cancel out any performance improvements from using the SortMerge algorithm. There is no overhead to renaming columns and as I showed in the previous posts in this query, you should remove any unwanted columns before the merge anyway.

      1. “If your data is not sorted, then you can sort it in Power Query before the merge – but since sorting itself takes time and sorting for non-foldable data sources is another one of those operations which requires the table to be held in memory, you’re unlikely to get any performance improvement.”

        I haven’t tested (yet), but fwiw for some other programming I’ve done (specifically adding a large array of items to a Scripting.Dictionary in Excel VBA) the time required to import the data into Excel and presort was a small fraction of the efficiency gain I got from adding sorted vs unsorted data. Sorting took seconds, using sorted vs unsorted cut processing time from 25 to 14 seconds. So sorting gave a VERY good ROI.

        Will be interesting to test if this holds true for PQ. I’ll try this on my current data (merging 800k rows with 800k rows) time willing.

  5. As per twitter thread at https://twitter.com/InsightsMachine/status/1281771716448890880 I’ve found that when using this on CSV files I’ve had to buffer the *result* of it before joining to a third table. (That third table was a derivitaive of one of the other two, where I had done a Group By to bring totals in).

    When I didn’t use Table.Buffer, I got an empty table when expanding columns…even though both tables I was merging a) both had data and b) both had identical key columns with identical type.

    Buffering the output of the preceding Table.Join sorted the issue. So it seems there was some kind of streaming going on that resulted in that empty table, that was fixed by buffering the entire table in memory.

  6. Hi Chris,
    I try to use JoinAlgorithm.SortMerge inside a dataflow. In the On-line Query Editor the result of my query is good. But when i connect to my Dataflow from Power BI Desktop i loose a lot of rows of data……

  7. Hey Chris,
    amazing blog as always. Im wondering if you’ll cover the other types of joinalgorithms. there is much needed demand of that on the www. and microsoft has provided 0 inputs sadly.

  8. Can this technique be used when merging on two columns given that both columns are sorted in both tables (second column sorted within first) ?

  9. I would like to better understand why Power BI merges become so large. I am using multiple merges, with each single query no larger than 5 M rows…but when I download a merged version, it downloads 17 M rows. Can Table.Join help somehow?

  10. What if I want to use this numerous times in a single query? I can’t seem to add the prior step name to the function without error. For example, “=TableJoin(#”Prior Step Name”, Table1, “Name”, Table2, “Name”, JoinKind.Inner, JoinAlgorithm.SortMerge)”. I have a large flat file with mostly text columns (there’s no option to change this). Using a staging query, I referenced each text column into its own query, removed dups and added an index column. Then I have to merge the IDs back into the secondary query to add the IDs and remove the text columns. If I use the function without the step name, it will just keep the last one I do. If I add the step name, it errors – “We cannot convert a value of type List to type table.” Maybe I need to create a custom function for it?

    1. No, I think you’re making a mistake in your code somewhere – this should be straightforward and there’s no need for a custom function. Can you post your full M query?

      1. I get the same error.

        The original version works:
        = Table.NestedJoin(Merge_Actual_Fcst, {“Style Date Key”}, Merge_LastFcst_LostSales, {“Style Date Key”}, “LastFcst_Stacked”, JoinKind.LeftOuter)

        Merge_Actual_Fcst has 133k rows and Merge_LastFcst_LostSales has 100k rows and the Nested Join takes 15 minutes.

        The Table.Join version throws an error:
        = Table.Join(Merge_Actual_Fcst, {“Style Date Key”}, Merge_LastFcst_LostSales, {“Style Date Key”}, “LastFcst_Stacked”, JoinKind.LeftOuter, JoinAlgorithm.SortMerge)

        Here is the error:
        Expression.Error: We cannot convert the value “LastFcst_Stacked” to type Number.
        Details:
        Value=LastFcst_Stacked
        Type=[Type]

        The LastFcst_Stacked column is Whole Number. The Style Date Key columns in both tables are Text.

        A comment I saw elsewhere said that a table reference followed by a column name will return a List type, which might cause the error.

  11. I am facing the same problem, except I have created my tables using SQL queries from an oracle database. I tried to interpret the code to make it work with table/query names instead of the spreadsheet columns etc. you show in your example but I am having no luck. I would like to merge table DATES, field WRK_REQ_ID with table JEOPS, field WRK_REQ_ID. Please help?

  12. This is great thank you
    Have you noticed the following
    I’m merging multiple dimension tables to a fact table ( dont ask why) using Merge “Table Join” as I read in “https://towardsdatascience.com/query-folding-in-power-bi-devil-is-in-the-detail-d564ab0cb32” that this would enable query folding to occur.

    Here is the issue / challenge –
    the results of the first merge using “Table.Join” are no longer in the fact table once the second merge using “Table.Join” and so the results of the second merge using “Table.Join” are no longer in the fact table once the 3rd merge using “Table.Join” is in place

    any ideas thanks in advance Adam

  13. Hi

    Learned a lot from you blog. Thanks!

    While the answer seems obvious, I can’t seem to find it documented anywhere: Is the result of a Table.Join(….., JoinAlgorithm.SortMerge) sorted?

    Ed

  14. Inevitably, when I’m stuck and search on a Power Query problem, I find one of your posts that tackle the problem head on. My tables didn’t seem that big (14k and 6k rows with 30 columns) but the merge was taking quite a long time. This approach has brought it down to under 20 seconds.

    Thank you so much for your invaluable site. I have learned so much from your posts.

    Les Z

Leave a Reply to Chris WebbCancel reply