Improving The Performance Of Aggregation After A Merge In Power BI And Excel Power Query/Get&Transform

A long time ago someone – probably from the Power Query dev team – told me that adding primary keys to M tables could improve the performance of certain transformations in Power BI and Excel Power Query/Get&Transform. Indeed, I mentioned this in chapter 5 of my book “Power Query for Power BI and Excel”, but I it wasn’t until this week that I found a scenario where this was actually the case. In this post I’ll describe the scenario and try to draw some conclusions about when adding primary keys to tables might make a difference to performance.

Imagine you have two queries that return data from two different files. First of all there’s a query called #”Price Paid”, which reads data from a 140MB csv file containing some of my favourite demo data: all 836933 property transactions in the year 2017 from the UK Land Registry Price Paid dataset. The query does nothing special, just reads the all the data, renames some columns, removes a few others, and sets data types on columns. Here’s what the output looks like:

Second, you have a query called #”Property Types” that reads data from a very small Excel file and returns the following table:

As you can see, the Property Type column from the #”Price Paid” query contains single letter codes describing the type of property sold in each transaction; the Property Type column from #“Property Types” contains a distinct list of the same codes and acts as a dimension table. Again there’s nothing interesting going on in this query.

The problems start when you try to join data from these two queries using a Merge and then, for each row in #”Property Types”, show the sum of the Price Paid column from #”Price Paid”. The steps to do this are:

1. Click on the Merge Queries/Merge Queries as New button on the Home tab in the Power Query Editor:

2. In the Merge dialog, do a left outer join between #”Property Types” and #”Price Paid” on the Property Type column on each table

3. Back in the Power Query Editor window, click on the Expand/Aggregate button in the top right-hand corner of the column created by the previous step.

4. Click on the Aggregate radio button and select Sum of Price Paid

The output of the query is this table:

Here’s the M code for this query, as generated by the UI:

let
Source =
Table.NestedJoin(
#"Property Types",
{"Property Type"},
#"Price Paid",
{"Property Type"},"Price Paid",
JoinKind.LeftOuter),
#"Aggregated Price Paid" =
Table.AggregateTableColumn(
Source,
"Price Paid",
{{"Price Paid", List.Sum, "Sum of Price Paid"}})
in
#"Aggregated Price Paid"

It’s a very common thing to do, and in this case the query is extremely slow to run: on my laptop, when you refresh the preview in the Power Query Editor window in Power BI Desktop it takes 55 seconds. What’s more, in the bottom left-hand corner of the screen where it displays how much data is being read from the source file, it shows a value that is a lot more than the size of the source file. In fact it seems like the source file for #”Price Paid” is read once for each row in the #”Property Types” query: 140MB * 5 rows = 700MB.

Not good. But what if you specify that the Property Type column from the #”Property Types” query is the primary key of the table? After all it does contain unique values. Although it isn’t widely known, and it’s not shown in the UI, tables in M can have primary and foreign keys defined on them whatever data source you use (remember that in this case the data sources are csv and Excel). One way to do this is using Table.AddKey as follows:

let
WithAddedKey =
Table.AddKey(
#"Property Types",
{"Property Type"},
true),
Source =
Table.NestedJoin(
WithAddedKey,
{"Property Type"},
#"Price Paid",
{"Property Type"},
"Price Paid",
JoinKind.LeftOuter),
#"Aggregated Price Paid" =
Table.AggregateTableColumn(
Source,
"Price Paid",
{{"Price Paid", List.Sum, "Sum of Price Paid"}})
in
#"Aggregated Price Paid"

And guess what? After making this change, the query only takes 12 seconds to run instead of 55 seconds! What’s more the amount of data read from disk shown in the UI suggests that the source file for #”Price Paid” is only read once. But making this change involves writing M code, and not everyone is comfortable making changes in the Advanced Editor. The good news is that it’s possible to get this performance benefit in another way without writing any M code.

As I mentioned in this blog post, using the Remove Duplicates transformation on a column has the side-effect of marking that column as a primary key. Therefore if you right-click on the Property Type column and select Remove Duplicates in between steps 2 and 3 above, before you click on the Expand/Aggregate button:

…then you also get the performance benefit. Here’s the M code generated by the UI for this query:

let
Source =
Table.NestedJoin(
#"Property Types",
{"Property Type"},
#"Price Paid",
{"Property Type"},
"Price Paid",
JoinKind.LeftOuter),
#"Removed Duplicates" =
Table.Distinct(
Source,
{"Property Type"}),
#"Aggregated Price Paid" =
Table.AggregateTableColumn(
#"Removed Duplicates",
"Price Paid",
{{"Price Paid", List.Sum, "Sum of Price Paid"}})
in
#"Aggregated Price Paid"

It’s #”Removed Duplicates” that is the new step here, and it uses the Table.Distinct function.

All this suggests that if you are doing merge operations on queries that get data from large csv or Excel files and then aggregating data, if you can set a primary key on one of the columns used in the join it can make a massive difference to performance. However this is just one example, and I would be really interested to hear if any of you can reproduce these findings with your own data and queries – if you can, please leave a comment. I suspect this could be a very important discovery for M query performance tuning!

45 thoughts on “Improving The Performance Of Aggregation After A Merge In Power BI And Excel Power Query/Get&Transform

  1. What an interesting find, Chris. When I read the article title, my initial guess would be that adding the primary keys on the larger table would help, but it’s the other way around. What this also shows, is that PowerQuery still has no proper way of identifying dependencies between data sources. It seems it is iterating over the ‘Property Types’ and loading the join table again for every row. I find it strange the engine does not seem to buffer the file in-memory and load from there. Or even better, reverse the order of execution and first do the aggregation on the key column, and then do the join.
    I wonder if the same happens when using for example SQL as the right join table, even with query folding…

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      Interestingly, when I tried to use Table.Buffer (before I set the primary key) it actually made performance worse. To be honest, though, this post raises more questions than it answers, and I think there are more discoveries to be made in this area.

      1. LOL! I was on an email thread with Imke Feldman the other day and her comment was “performance tuning is more art than science”. It’s frustratingly true right now. There has to be rules around this, but trying to work them out is painful!

  2. Hey Chris, this is really interesting. Do you happen to know if this is only for aggregation, or does it also improve speed if it’s a flat out expansion without aggregation?

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      As far as I can see it’s only for aggregation – I didn’t notice any difference for expansion, but I will do some proper testing.

      1. Chris Haas – Minnesota, USA – I am a business intelligence consultant that specializes in optimizing data models, tuning DAX, and simplifying visualizations to drive action.
        Chris H says:

        Looking forward to your results Chris! Please consider updating this blog with your findings?

      2. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
        Chris Webb says:

        That’s fast too

  3. I had exactly this situation today, while trying to aggregate some values from many-side table. Many ERP’s have data structure like this, mine was coming from MS Navision. I am wondering selecting more than two columns and then making them unique has the same affect, cause the PK is a combination of columns sometimes, not only one pk column. Of course they can be scombined into single column. I’ll check this technique. Thanks for the information, I was not aware of this.

    1. Bradley Sawler – Australia – I am a mechanical engineering professional in mining and energy. I have a passion for discovering and implementing tools and processes to improve productivity.
      Bradley Sawler says:

      Thanks, heaps Chris for this post. I read with enthusiasm until I read Ken’s comment, as I generally expand without aggregation, so that dampened my spirits. So I ran my query and stopped it after 10 mins as it was still loading and I couldn’t bother waiting any longer. I added the primary key it loaded in 2.5 minutes. So it was at least 4x quicker!

    2. I just performed a two-column merge and I can vouch for the fact that my aggregation was much, MUCH faster after removing duplicates from those two columns (even though there were none to begin with).

  4. Thanks Chris for this blog post. I have noticed this issue in the past but i’ve not tried to use Table.AddKey, so thank you for showing this possibility.
    I’ve solved this problem in a different way.

    let
    Source = Table.NestedJoin(PropertyTypes,{“Property Type”},Table.Group(#”Price Paid”[[Price Paid],[Property Type]], {“Property Type”}, {{“price”, each List.Sum([Price Paid]) }} ),{“Property Type”},”Price Paid”,JoinKind.LeftOuter),
    ExpandedPrice = Table.ExpandTableColumn(Source, “Price Paid”, {“price”}, {“price”})
    in
    ExpandedPrice

    It works without Primary Key and seems to be even a little faster than Your way (with Table.AddKey )….please, try.
    Performanece with and without primary key looks the same or very similar and memory usage is on proper level (164 MB)
    So, probably we have a problem with algoritm of Table.AggregateTableColumn function.
    I think, there’s much more to discover in PQ :-))
    Anyway, once again, thank you for this article.

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      Thanks Bill – that’s very interesting. On my machine performance seems to be roughly the same with both techniques, but I have to admit I was only timing the queries with my watch. I agree we all have a lot more to learn about the internals of Power Query!

  5. Fantastic find. I’m noting this has a ‘must implement’ and am going back to revisit previous solutions where i’m sure this can be implemented. I’d have been happy to amend the M code but the Remove Duplicates trick makes it even easier, thanks!

  6. Thank you for sharing this technique,it’s useful.

    May I ask an irrelevant questions here about the 2nd argument of List.Sort. Acrdng to the official instruction, an optional comparison criteria can be used to sort the list, hereunder is extracted from the official website:

    “…To completely control the comparison, a function with two Arguments can be used that returns -1, 0, or 1 given the relationship between the left and right inputs. Value.Compare is a method that can be used to delegate this logic….”

    I’ll take list {1..9} for example:

    1. = List.Sort({1..9},(x,y)=>Value.Compare(1/x,1/y)) equals to {9,8,7,6,5,4,3,2,1}

    2.= List.Sort({1..9},(x,y)=>Value.Compare(y,x)) equals to {9,8,7,6,5,4,3,2,1}

    3.= List.Sort({1..9},(x,y)=>Value.Compare(y,y)) equals to {7,6,9,8,5,2,1,4,3}
    = List.Sort({1..9},each 1) equals to {7,6,9,8,5,2,1,4,3}
    = List.Sort({1..9},each _>9) equals to {7,6,9,8,5,2,1,4,3}

    = List.Sort({1..9},each _>5) equals to {4,5,3,1,2,8,9,6,7}

    4.= List.Sort({1..9},(x,y)=>Value.Compare(x,y-5)) equals to {7,8,6,1,5,2,9,3,4}

    5.= List.Sort({1..9},(x,y)=>Value.Compare(x,5-y)) equals to {1,7,6,8,5,2,4,9,3}

    We can write many other comparison criteria, sometimes we got the same result, sometimes not…This is interesting and also kinda strange. I just cannot figure out how exactly it works.

    So, do you happen to know what kind of algorithm it uses? I’m looking forward your kind help.

    Thank you for your time.

    B.rgds,
    John Bian

  7. My bet would be that the merge operation identifies the left part of the join in a left outer as the many side of the query (Which makes sense – one usually left joins to a fact table). When then you select to perform a summary operation on the right side table (Thereby identifying it as a fact table) the engine probably assesses the join as a many to many and therefore performs a Cartesian join. When you exclaim that the left hand side table is distinct on the join field, the engine probably then reverts to treating it like a one to many again. I wonder if an inner join or right join (probably more appropriate for the scenario anyway, no?) would behave differently. Unfortunately I won’t be able to test this as my current environment is not-MS.

    1. Nimrod, I think you’re 100% right here. Power Query does perform a cartesian join by default, and to my knowledge does this for all the join types (although could be corrected there.) It’s curious to me that the performance gains only seem to be on the aggregation though… I would have expected them to also impact a straight merge…

  8. Interesting. But I’m still scratching my head trying to think of an actual world scenario when I would want to denormalize data in PowerQuery, rather than simply create a relationship in PowerPivot.

    (Granted, you’re using this example because it nicely illustrates the performance gain should you choose PQ as your hammer for this particular nut. But given I’d probably never use said hammer on this nut, I’m wondering if there’s any other kind of nut I’m likely to need to crack that could be consumed more speedily knowing this…)

  9. Great information as always Chris. One thing though. I could see that you defined the keys within the merging query (lets call it #”Aggregated Price Paid”). Will this mean that the keys are only defined within the scope of that query?
    Reading the Microsoft documentation for Table.AddKey, they do this as the last step for the joined table (in this case it would have been the #”Property Types” table), any thought about performance and allocated memory differences between this two methods

    Is this to be seen as an index in SQL. Then maybe a good advice would be to create these indexes as the very last step of every query that will be used in a heavy join-like operation like below:

    DIM1:
    let
    table=….,
    tableWithPK = Table.AddKey(table, {“PK”}, true),
    tableWithFK = Table.AddKey(tableWithPK, {“FK1”, “FK2”}, false)

    //maybe the line above could also look like this
    //tableWithFK = Table.AddKey(table, {“FK1”, “FK2”}, false)
    in
    tableWithFK

    If mergin within the same guery, maybe it would make sence to temporary create keys on the temporary table like below:

    FACT1:
    let
    source=…,
    tempkeys = Table.AddKey(source, {“FK”}, false),
    mergedTable=Table.NestedJoin…,
    expandedTable=Table.ExpandTableColumn…
    in
    expandedTable

  10. No Key on primary tables is slightly faster.

    I have 3 large tables to join; but only need a small subset of data from all three.

    I saved 1 file with having Key’s on the Large Tables.
    A second file with No-Key’s on the Large Tables.

    For each – I opened the file, and refreshed.

    The Key file took ~85 seconds on first refresh after opening.

    The No-Key file took ~81 seconds on first refresh after opening.

    Note: I tested only first refresh because that’s how my users will use this. Open, take data, then close.
    Note: Two of my large tables are combinations of ~8 excel workbooks each. My third table is all one excel file.
    Note: Obviously, results are within the margin of error. Your Mileage May Very.

  11. vlookup(…false) only takes less than 10 seconds for tables of 200,000 rows. And is much simpler, except for the need to do in chunks. I’ve implemented in xlVlookup using binary search and application.index in 65,536 row chunks (due to index dll not being updated yet?). So it appears xlVlookup is fastest?

  12. Thank you for this Chris. It has just helped me to reduce the data load time from 6 mins to 2 mins on a model that I am working on.

  13. Thanks for this blog Chris. But I can’t help but ask why is it acceptable that Power Query is so slow doing aggregations when grouping is so much faster? Other languages don’t struggle as much as Power Query M. I try to make as much as possible in R or Python to avoid such bad performance.

  14. I find this very aggravating. I feel like I spend far too much of my day watching this: https://www.screencast.com/t/s1H3jeHguS

    I’m currently working on a project that is transforming and joining data between a few different worksheets. Joins in a couple of the queries are unbearably slow. I’ve tried a number of solutions for trying to make certain joins faster including things like 1. Adding these keys 2. Pairing down columns and rows as much as possible 3. Creating concatenate composite columns when needing to join more than one column together 4. Table.Buffer beforehand on either/both tables for join 5. Clearing the query cache 6. Turning on/off “Fast Data Load” and “Allow data preview to download in the background” 7. And whatever else has come to mind that day.

    And yet, I sit here waiting for a 1k table to be joined to a 800 row table and it’s taking forever on my dual-6 core i7 machine that has 48GB of RAM.

    1. Binary search followed by Index appears to be very fast for all Excel table Joins. Only a few seconds, for up to 300,000 rows by dozens of columns. I can share my code if you show me how to put it up to github.

  15. I had 3 large merged tables joining fields in 6 tables each. It was working until this past weekend, albeit slowly, when they changed the server/db for our updated ERP. I updated the data source settings easily. But then updating my pending queries kept timing out.

    Per your suggestion, I just added an index column to the 3 merged tables making sure it was the first column in each. All 2 dozen of my queries began loading within a few minutes.

    This was a huge find! Thanks for sharing this idea.

  16. Mr. Webb, you’ve just redeemed many days of my (past and future) work. Perhaps, due to my poor skills and knowledge, the help Your suggestion brought me was the most extreme one among the peers here:
    I had to merge and aggregate two large tables, almost 900 K rows vs 50 K rows. The merge wa based on three columns forming the unique key together. I was watching the source file getting loaded FOREVER, and I don’t mean minutes nor hours: after five days it says 250TB loaded from that 40MB csv file…and woud keep going on, probably as you said, for each row again. That’s where I gave up for the last time and knew it’s due to my bad coding approach. Upon setting up a way-around, I ran into this post. And what a news, It was done in about a minute…
    THANK YOU!

Leave a Reply to Chris WebbCancel reply