Power Query/M Optimisation: Getting The Maximum Value From A Column, Part 2

In part 1 of this series – which I strongly recommend you read before reading this post – I showed how removing columns from a table can make a dramatic improvement to the performance of certain transformations in Power Query. In this post I’ll show some tricks taught to me by Curt Hagenlocher of the dev team that can improve performance even more.

First of all, let me remind you of my original, totally unoptimised query which takes over a minute to run:

let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Sorted Rows" = 
    Table.Sort(
        #"Changed Type",
        {{"Column2", Order.Descending}}
        ),
    Column2 = 
    #"Sorted Rows"{0}[Column2]
in
    Column2

In particular, I’d like to draw your attention to the last step where the position and lookup operators (I blogged about them here) are used to return the value in the Column2 column of the first line of the sorted table:

#”Sorted Rows”{0}[Column2]

Breaking this expression down:

  • #”Sorted Rows” returns the sorted table from the previous step in the query
  • {0} returns the first line of that table as a record
  • [Column2] gives the value in the Column2 field from that record

It turns out that changing the order of {0} and [Column2] in this expression makes a big impact on performance. So using the following expression in the last step instead:

#”Sorted Rows”[Column2]{0}

…reduces query execution time to 5-7 seconds! I’m told this is because asking for the column first does the same thing internally as manually removing all other columns in the table which, as I showed in my last post, has a dramatic impact on performance.

It’s important to understand that this is not a general recommendation to request the column before the row in expressions like this, because in other scenarios requesting the row before the column might perform better. What you need to do is test writing expressions like this both ways to see what gives you the best performance.

One piece of general advice that Curt did give me, though, was that most of the optimisations that the Power Query engine can use only happen on tables – not records or lists – so you should always try to work with tables whenever you can. For this particular query, using the Table.FirstN function to get the first row of the table like so:

Table.FirstN(#”Sorted Rows”,1)[Column2]{0}

…allows for another internal optimisation to kick in, taking the query execution time down to around 2 seconds – the same performance as the original query in my previous post.

Power Query/M Optimisation: Getting The Maximum Value From A Column, Part 1

I’ve just learned some really interesting new Power Query/M optimisation tricks! I’m going to write them up here as a series of worked examples rather than generalised patterns because I’m still trying to understand them properly myself and I’m not really sure what lessons to draw from them exactly; instead, I hope they give you some new ideas to try when optimising your own queries. I do think they will be useful to a lot of people though.

In this first part I’m going to set up the scenario and show you what I found out from my own experimentation. The really mind-blowing tricks shown to me by the ever-helpful Curt Hagenlocher of the Power Query dev team will be covered in part 2.

Let’s say you have a large csv file which contains a numeric column and you want to get the maximum value from that column. In this case I’m going to use the 2018 Price Paid data from the UK Land Registry available here. This csv file contains 1021215 rows, one for each property transaction in England and Wales in 2018; the second column in this file contains the the price paid for the property, so the aim here is to get the maximum price paid for all property transactions in 2018.

You can build this query quickly and easily, and get excellent performance, with a few clicks in the UI. After connecting to the csv file and setting the data type on Column2 to Whole Number, all you need to do is select Column2, go to the Transform tab in the Power Query Editor window and click the Statistics then select Maximum from the dropdown menu:

image

This returns the number we’re looking for:

image

The query takes around 1.5 seconds to run (I used the technique I blogged about here to measure the duration). Here’s the M code for the query:

let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Calculated Maximum" = 
    List.Max(#"Changed Type"[Column2])
in
    #"Calculated Maximum"

I, of course, did not use this very efficient method when I first built my query. Instead did the following: after loading the data I sorted the table by Column2 in descending order and then right-clicked in the top cell in Column2 and selected Drill Down:

image

Here’s the resulting M code:

let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Sorted Rows" = 
    Table.Sort(
        #"Changed Type",
        {{"Column2", Order.Descending}}
        ),
    Column2 = 
    #"Sorted Rows"{0}[Column2]
in
    Column2

The performance of this query is much worse: 75 seconds, although this varies a lot. So I tried to work out what’s going on here and see if I could improve performance… and that’s when I started learning.

The variation in the amount of time taken to run the query made me think about memory usage and the 256MB container size limit (see this blog post for important background information) and sure enough, Resource Monitor showed that this query was hitting the 256MB limit – unsurprising because sorting is one of those transformations that requires a table to be loaded into memory completely (at least in the worst case – and this seemed like the worst case). Why not reduce the size of the table then? Since only Column2 is needed by the query output I removed all other columns in the table before doing the sort, resulting in the following M:

let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Removed Other Columns" = 
    Table.SelectColumns(
        #"Changed Type",
        {"Column2"}
        ),
    #"Sorted Rows" = 
    Table.Sort(
        #"Removed Other Columns",
        {{"Column2", Order.Descending}}
        ),
    Column2 = 
    #"Sorted Rows"{0}[Column2]
in
    Column2

This reduced query execution time a lot – it still varied, but now it was in the range of 5 to 8 seconds.

This leads to the first important performance tuning tip: remove all unnecessary columns from your tables to reduce memory overhead, especially if you’re doing memory-intensive transformations such as sorts, merges, groupings, pivots or unpivots. Of course you should be removing all columns you don’t need for your reports anyway, but the point here is that:

  • You should remove columns you don’t need in your dataset as early as possible in your queries
  • In situations where you need to do more complex transformations in intermediate queries (ie queries that don’t load direct into the dataset but whose output is used by queries that do), remove all but the columns needed by the transformation

Tune in next week for part 2, where I’ll show even more examples of how this particular query can be tuned – and I promise that all you Power Query fans will learn something new!

%d bloggers like this: