Improving The Performance Of Aggregation After A Merge In Power BI And Excel Power Query/Get&Transform

A long time ago someone – probably from the Power Query dev team – told me that adding primary keys to M tables could improve the performance of certain transformations in Power BI and Excel Power Query/Get&Transform. Indeed, I mentioned this in chapter 5 of my book “Power Query for Power BI and Excel”, but I it wasn’t until this week that I found a scenario where this was actually the case. In this post I’ll describe the scenario and try to draw some conclusions about when adding primary keys to tables might make a difference to performance.

Imagine you have two queries that return data from two different files. First of all there’s a query called #”Price Paid”, which reads data from a 140MB csv file containing some of my favourite demo data: all 836933 property transactions in the year 2017 from the UK Land Registry Price Paid dataset. The query does nothing special, just reads the all the data, renames some columns, removes a few others, and sets data types on columns. Here’s what the output looks like:

image

Second, you have a query called #”Property Types” that reads data from a very small Excel file and returns the following table:

image

As you can see, the Property Type column from the #”Price Paid” query contains single letter codes describing the type of property sold in each transaction; the Property Type column from #“Property Types” contains a distinct list of the same codes and acts as a dimension table. Again there’s nothing interesting going on in this query.

The problems start when you try to join data from these two queries using a Merge and then, for each row in #”Property Types”, show the sum of the Price Paid column from #”Price Paid”. The steps to do this are:

1. Click on the Merge Queries/Merge Queries as New button on the Home tab in the Power Query Editor:

image

2. In the Merge dialog, do a left outer join between #”Property Types” and #”Price Paid” on the Property Type column on each table

image

3. Back in the Power Query Editor window, click on the Expand/Aggregate button in the top right-hand corner of the column created by the previous step.

image

4. Click on the Aggregate radio button and select Sum of Price Paid

image

The output of the query is this table:

image

Here’s the M code for this query, as generated by the UI:

let
    Source = 
	Table.NestedJoin(
		#"Property Types",
		{"Property Type"},
		#"Price Paid",
		{"Property Type"},"Price Paid",
		JoinKind.LeftOuter),
    #"Aggregated Price Paid" = 
	Table.AggregateTableColumn(
		Source, 
		"Price Paid", 
		{{"Price Paid", List.Sum, "Sum of Price Paid"}})
in
    #"Aggregated Price Paid"

It’s a very common thing to do, and in this case the query is extremely slow to run: on my laptop, when you refresh the preview in the Power Query Editor window in Power BI Desktop it takes 55 seconds. What’s more, in the bottom left-hand corner of the screen where it displays how much data is being read from the source file, it shows a value that is a lot more than the size of the source file. In fact it seems like the source file for #”Price Paid” is read once for each row in the #”Property Types” query: 140MB * 5 rows = 700MB.

image

Not good. But what if you specify that the Property Type column from the #”Property Types” query is the primary key of the table? After all it does contain unique values. Although it isn’t widely known, and it’s not shown in the UI, tables in M can have primary and foreign keys defined on them whatever data source you use (remember that in this case the data sources are csv and Excel). One way to do this is using Table.AddKey as follows:

let
    WithAddedKey = 
	Table.AddKey(
		#"Property Types", 
		{"Property Type"}, 
		true),
    Source = 
	Table.NestedJoin(
		WithAddedKey,
		{"Property Type"},
		#"Price Paid",
		{"Property Type"},
		"Price Paid",
		JoinKind.LeftOuter),
    #"Aggregated Price Paid" = 
	Table.AggregateTableColumn(
		Source, 
		"Price Paid", 
		{{"Price Paid", List.Sum, "Sum of Price Paid"}})
in
    #"Aggregated Price Paid"

And guess what? After making this change, the query only takes 12 seconds to run instead of 55 seconds! What’s more the amount of data read from disk shown in the UI suggests that the source file for #”Price Paid” is only read once. But making this change involves writing M code, and not everyone is comfortable making changes in the Advanced Editor. The good news is that it’s possible to get this performance benefit in another way without writing any M code.

As I mentioned in this blog post, using the Remove Duplicates transformation on a column has the side-effect of marking that column as a primary key. Therefore if you right-click on the Property Type column and select Remove Duplicates in between steps 2 and 3 above, before you click on the Expand/Aggregate button:

image

…then you also get the performance benefit. Here’s the M code generated by the UI for this query:

let
    Source = 
	Table.NestedJoin(
		#"Property Types",
		{"Property Type"},
		#"Price Paid",
		{"Property Type"},
		"Price Paid",
		JoinKind.LeftOuter),
    #"Removed Duplicates" = 
	Table.Distinct(
		Source, 
		{"Property Type"}),
    #"Aggregated Price Paid" = 
	Table.AggregateTableColumn(
		#"Removed Duplicates", 
		"Price Paid", 
		{{"Price Paid", List.Sum, "Sum of Price Paid"}})
in
    #"Aggregated Price Paid"

It’s #”Removed Duplicates” that is the new step here, and it uses the Table.Distinct function.

All this suggests that if you are doing merge operations on queries that get data from large csv or Excel files and then aggregating data, if you can set a primary key on one of the columns used in the join it can make a massive difference to performance. However this is just one example, and I would be really interested to hear if any of you can reproduce these findings with your own data and queries – if you can, please leave a comment. I suspect this could be a very important discovery for M query performance tuning!

24 thoughts on “Improving The Performance Of Aggregation After A Merge In Power BI And Excel Power Query/Get&Transform

  1. What an interesting find, Chris. When I read the article title, my initial guess would be that adding the primary keys on the larger table would help, but it’s the other way around. What this also shows, is that PowerQuery still has no proper way of identifying dependencies between data sources. It seems it is iterating over the ‘Property Types’ and loading the join table again for every row. I find it strange the engine does not seem to buffer the file in-memory and load from there. Or even better, reverse the order of execution and first do the aggregation on the key column, and then do the join.
    I wonder if the same happens when using for example SQL as the right join table, even with query folding…

    • Interestingly, when I tried to use Table.Buffer (before I set the primary key) it actually made performance worse. To be honest, though, this post raises more questions than it answers, and I think there are more discoveries to be made in this area.

      • LOL! I was on an email thread with Imke Feldman the other day and her comment was “performance tuning is more art than science”. It’s frustratingly true right now. There has to be rules around this, but trying to work them out is painful!

  2. Hey Chris, this is really interesting. Do you happen to know if this is only for aggregation, or does it also improve speed if it’s a flat out expansion without aggregation?

  3. I had exactly this situation today, while trying to aggregate some values from many-side table. Many ERP’s have data structure like this, mine was coming from MS Navision. I am wondering selecting more than two columns and then making them unique has the same affect, cause the PK is a combination of columns sometimes, not only one pk column. Of course they can be scombined into single column. I’ll check this technique. Thanks for the information, I was not aware of this.

  4. Thanks Chris for this blog post. I have noticed this issue in the past but i’ve not tried to use Table.AddKey, so thank you for showing this possibility.
    I’ve solved this problem in a different way.

    let
    Source = Table.NestedJoin(PropertyTypes,{“Property Type”},Table.Group(#”Price Paid”[[Price Paid],[Property Type]], {“Property Type”}, {{“price”, each List.Sum([Price Paid]) }} ),{“Property Type”},”Price Paid”,JoinKind.LeftOuter),
    ExpandedPrice = Table.ExpandTableColumn(Source, “Price Paid”, {“price”}, {“price”})
    in
    ExpandedPrice

    It works without Primary Key and seems to be even a little faster than Your way (with Table.AddKey )….please, try.
    Performanece with and without primary key looks the same or very similar and memory usage is on proper level (164 MB)
    So, probably we have a problem with algoritm of Table.AggregateTableColumn function.
    I think, there’s much more to discover in PQ :-))
    Anyway, once again, thank you for this article.

  5. Fantastic find. I’m noting this has a ‘must implement’ and am going back to revisit previous solutions where i’m sure this can be implemented. I’d have been happy to amend the M code but the Remove Duplicates trick makes it even easier, thanks!

  6. Thank you for sharing this technique,it’s useful.

    May I ask an irrelevant questions here about the 2nd argument of List.Sort. Acrdng to the official instruction, an optional comparison criteria can be used to sort the list, hereunder is extracted from the official website:

    “…To completely control the comparison, a function with two Arguments can be used that returns -1, 0, or 1 given the relationship between the left and right inputs. Value.Compare is a method that can be used to delegate this logic….”

    I’ll take list {1..9} for example:

    1. = List.Sort({1..9},(x,y)=>Value.Compare(1/x,1/y)) equals to {9,8,7,6,5,4,3,2,1}

    2.= List.Sort({1..9},(x,y)=>Value.Compare(y,x)) equals to {9,8,7,6,5,4,3,2,1}

    3.= List.Sort({1..9},(x,y)=>Value.Compare(y,y)) equals to {7,6,9,8,5,2,1,4,3}
    = List.Sort({1..9},each 1) equals to {7,6,9,8,5,2,1,4,3}
    = List.Sort({1..9},each _>9) equals to {7,6,9,8,5,2,1,4,3}

    = List.Sort({1..9},each _>5) equals to {4,5,3,1,2,8,9,6,7}

    4.= List.Sort({1..9},(x,y)=>Value.Compare(x,y-5)) equals to {7,8,6,1,5,2,9,3,4}

    5.= List.Sort({1..9},(x,y)=>Value.Compare(x,5-y)) equals to {1,7,6,8,5,2,4,9,3}

    We can write many other comparison criteria, sometimes we got the same result, sometimes not…This is interesting and also kinda strange. I just cannot figure out how exactly it works.

    So, do you happen to know what kind of algorithm it uses? I’m looking forward your kind help.

    Thank you for your time.

    B.rgds,
    John Bian

  7. My bet would be that the merge operation identifies the left part of the join in a left outer as the many side of the query (Which makes sense – one usually left joins to a fact table). When then you select to perform a summary operation on the right side table (Thereby identifying it as a fact table) the engine probably assesses the join as a many to many and therefore performs a Cartesian join. When you exclaim that the left hand side table is distinct on the join field, the engine probably then reverts to treating it like a one to many again. I wonder if an inner join or right join (probably more appropriate for the scenario anyway, no?) would behave differently. Unfortunately I won’t be able to test this as my current environment is not-MS.

    • Nimrod, I think you’re 100% right here. Power Query does perform a cartesian join by default, and to my knowledge does this for all the join types (although could be corrected there.) It’s curious to me that the performance gains only seem to be on the aggregation though… I would have expected them to also impact a straight merge…

  8. Interesting. But I’m still scratching my head trying to think of an actual world scenario when I would want to denormalize data in PowerQuery, rather than simply create a relationship in PowerPivot.

    (Granted, you’re using this example because it nicely illustrates the performance gain should you choose PQ as your hammer for this particular nut. But given I’d probably never use said hammer on this nut, I’m wondering if there’s any other kind of nut I’m likely to need to crack that could be consumed more speedily knowing this…)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s