Power BI · Power Query

Optimising The Performance Of Power Query Merges In Power BI, Part 4: Table.Join And Other Join Algorithms

In the previous post in this series I showed how you can use the Table.Join function and the SortMerge algorithm to get much better performance for merge operations in Power Query – but only if your data is sorted. But what if your data isn’t sorted? Can those other join algorithms give you better performance? The answer is… you guessed it, it depends.

As a reminder, the seven join algorithms that can be used with Table.Join are:

  • JoinAlgorithm.Dynamic
  • JoinAlgorithm.LeftHash
  • JoinAlgorithm.LeftIndex
  • JoinAlgorithm.PairwiseHash
  • JoinAlgorithm.RightHash
  • JoinAlgorithm.RightIndex
  • JoinAlgorithm.SortMerge

The first thing to say is that if you don’t specify a join algorithm in the sixth parameter of Table.Join (it’s an optional parameter), Power Query will try to decide which algorithm to use based on some undocumented heuristics. The same thing also happens if you use JoinAlgorithm.Dynamic in the sixth parameter of Table.Join, or if you use the Table.NestedJoin function instead, which doesn’t allow you to explicitly specify an algorithm.

There are going to be some cases where you can get better performance by explicitly specifying a join algorithm instead of relying on JoinAlgorithm.Dynamic but you’ll have to do some thorough testing to prove it. From what I’ve seen there are lots of cases where explicitly setting the algorithm will result in worse performance, although there are enough cases where doing so results in better performance to make all that testing worthwhile.

For example, using the same CSV file that I’ve been using in my previous posts, I created two source queries called First and Second that only returned column A and the first 300,000 rows. Here’s the M code for First (the code for Second only differs in that it renames the only column to A2):

let
Source = Csv.Document(
File.Contents("C:\Users\chwebb\Documents\NumbersMoreColumns.csv"),
[Delimiter = ",", Columns = 7, Encoding = 65001, QuoteStyle = QuoteStyle.None]
),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
#"Removed Other Columns" = Table.SelectColumns(#"Promoted Headers", {"A"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Other Columns", {{"A", "A1"}}),
#"Kept First Rows" = Table.FirstN(#"Renamed Columns", 300000)
in
#"Kept First Rows"

Here’s a query that uses Table.Join and JoinAlgorithm.Dynamic to merge these two queries:

let
Source = Table.Join(First, {"A1"}, Second, {"A2"}, JoinKind.Inner, JoinAlgorithm.Dynamic)
in
Source

The average timings for this query on my PC were:

  • Progress Report End/25 Execute SQL – 2.0 seconds
  • Progress Report End/17 Read Data – 0.4 seconds

Changing this query to use JoinAlgorithm.LeftHash instead, like so:

let
Source = Table.Join(First, {"A1"}, Second, {"A2"}, JoinKind.Inner, JoinAlgorithm.LeftHash)
in
Source

…resulted in the following average timings:

  • Progress Report End/25 Execute SQL –  0.9 seconds
  • Progress Report End/17 Read Data – 0.6 seconds

An improvement of almost one second – but I’ve not included here all the other test results for algorithms that performed worse (I had to cancel the query that used JoinAlgorithm.LeftIndex because it was so slow). And just to be clear: I’m not saying that using JoinAlgorithm.LeftHash is always better than JoinAlgorithm.Dynamic, just that it happened to perform better in this case with these queries and this data. With different data and different queries then different algorithms may perform better. If you find other scenarios where specifying an algorithm improves performance then please let me know by leaving a comment.

6 thoughts on “Optimising The Performance Of Power Query Merges In Power BI, Part 4: Table.Join And Other Join Algorithms

  1. Parv Chana – BI Architect and Consultant with over 20 years of industry experience. I run a small consulting company, PeryTUS IT Solutions, where we implements Power BI solutions, which range from small to large enterprises. Experience working within a wide variety of industries and deployed many unique solutions to solve BI needs for my clients. I am Microsoft Data Platform MVP and has been working with Power BI ever since it was made available in 2015. Areas of focus include implementing end to end Power BI solutions (from data modelling to visualization), analyzing existing Power BI solutions, helping organizations follow best practices for scalable BI models, and providing training to help clients achieve their BI goals.
    Parv Chana says:

    I will surely test different Join Algorithms next time I come across similar scenario and see what works best. I was waiting for this post. Thanks for sharing your findings and guidance.

  2. Chris,

    I was struggling with joined table performance and Inka with the BI Accountant (https://www.thebiccountant.com) mentioned trying adding keys to my lookup tables I used in Table.NestedJoin to improve performance. Have you considered testing this theory? (I know Inka is one of your readers, she has a few links to your blog posts…)

    Anyway, thought i would mention it. Thanks for the series!

  3. To summarize (especially for myself)
    If you’re doing a ‘specific’ join, sounds like the ‘side’ should be the table with fewer records. In your case, with equal records, it doesn’t matter. But if First had fewer unique keys, than Left would be better.
    As for Hash verse Index verse PairwiseHash verse SortMerge….
    Only use SortMerge if both Join fields are in ascending order
    Only use Index if you actually have an Index created for one (both?) sides.
    Use PairwiseHash…… if each value only shows up once on each side?
    Otherwise use Hash.

  4. If I am using Merges as a LOOKUP type behavior, where I am just looking to get a ID from a dimension table perhaps, is there a way to create a temporary table in M before the merge that is only the two columns that I need from the lookup table? The idea would be create the temporary table in memory with just the two columns (search column and the value I am looking for), get the values and then move on rather than loading the entire dimension table into memory. Or does it do this already?

  5. I know this is old but you don’t mention that using this method, PQ requires the fieldnames to be unique. It could be an update since this article, but no matter which option I use, I get a ….. results in a duplicate column error or something to that means. i.e I do the table join and specify the keyfield, whatever the next column or first column in the table is throws that error. Well duh, that is why I am joining the tables….

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.