Chris Webb's BI Blog: Power Query/M Optimisation: Getting The Maximum Value From A Column, Part 1

I’ve just learned some really interesting new Power Query/M optimisation tricks! I’m going to write them up here as a series of worked examples rather than generalised patterns because I’m still trying to understand them properly myself and I’m not really sure what lessons to draw from them exactly; instead, I hope they give you some new ideas to try when optimising your own queries. I do think they will be useful to a lot of people though.

In this first part I’m going to set up the scenario and show you what I found out from my own experimentation. The really mind-blowing tricks shown to me by the ever-helpful Curt Hagenlocher of the Power Query dev team will be covered in part 2.

Let’s say you have a large csv file which contains a numeric column and you want to get the maximum value from that column. In this case I’m going to use the 2018 Price Paid data from the UK Land Registry available here. This csv file contains 1021215 rows, one for each property transaction in England and Wales in 2018; the second column in this file contains the the price paid for the property, so the aim here is to get the maximum price paid for all property transactions in 2018.

You can build this query quickly and easily, and get excellent performance, with a few clicks in the UI. After connecting to the csv file and setting the data type on Column2 to Whole Number, all you need to do is select Column2, go to the Transform tab in the Power Query Editor window and click the Statistics then select Maximum from the dropdown menu:

This returns the number we’re looking for:

The query takes around 1.5 seconds to run (I used the technique I blogged about here to measure the duration). Here’s the M code for the query:

[sourcecode language='text'  padlinenumbers='true']
let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Calculated Maximum" = 
    List.Max(#"Changed Type"[Column2])
in
    #"Calculated Maximum"
[/sourcecode]

I, of course, did not use this very efficient method when I first built my query. Instead did the following: after loading the data I sorted the table by Column2 in descending order and then right-clicked in the top cell in Column2 and selected Drill Down:

Here’s the resulting M code:

[sourcecode language='text'  padlinenumbers='true']
let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Sorted Rows" = 
    Table.Sort(
        #"Changed Type",
        {{"Column2", Order.Descending}}
        ),
    Column2 = 
    #"Sorted Rows"{0}[Column2]
in
    Column2
[/sourcecode]

The performance of this query is much worse: 75 seconds, although this varies a lot. So I tried to work out what’s going on here and see if I could improve performance… and that’s when I started learning.

The variation in the amount of time taken to run the query made me think about memory usage and the 256MB container size limit (see this blog post for important background information) and sure enough, Resource Monitor showed that this query was hitting the 256MB limit – unsurprising because sorting is one of those transformations that requires a table to be loaded into memory completely (at least in the worst case – and this seemed like the worst case). Why not reduce the size of the table then? Since only Column2 is needed by the query output I removed all other columns in the table before doing the sort, resulting in the following M:

[sourcecode language='text'  highlight='14,15,16,17,18']
let
    Source = 
    Csv.Document(
        File.Contents(
            "C:\Users\chwebb\Downloads\pp-2018.csv"
            ),
            [Delimiter=",", Columns=16, 
            Encoding=1252, QuoteStyle=QuoteStyle.None]),
    #"Changed Type" = 
    Table.TransformColumnTypes(
        Source,
        {{"Column2", Int64.Type}}
        ),
    #"Removed Other Columns" = 
    Table.SelectColumns(
        #"Changed Type",
        {"Column2"}
        ),
    #"Sorted Rows" = 
    Table.Sort(
        #"Removed Other Columns",
        {{"Column2", Order.Descending}}
        ),
    Column2 = 
    #"Sorted Rows"{0}[Column2]
in
    Column2
[/sourcecode]

This reduced query execution time a lot – it still varied, but now it was in the range of 5 to 8 seconds.

This leads to the first important performance tuning tip: remove all unnecessary columns from your tables to reduce memory overhead, especially if you’re doing memory-intensive transformations such as sorts, merges, groupings, pivots or unpivots. Of course you should be removing all columns you don’t need for your reports anyway, but the point here is that:

You should remove columns you don’t need in your dataset as early as possible in your queries
In situations where you need to do more complex transformations in intermediate queries (ie queries that don’t load direct into the dataset but whose output is used by queries that do), remove all but the columns needed by the transformation

Tune in next week for part 2, where I’ll show even more examples of how this particular query can be tuned – and I promise that all you Power Query fans will learn something new!

14 thoughts on “Power Query/M Optimisation: Getting The Maximum Value From A Column, Part 1”

Great post! So, if I have several CSVs (with a lot of columns) in a folder and use Folder import, the performance is better if I eliminate the excessive columns in the Sample (function) query instead in the Final one?

Chris Webb says:

August 11, 2019 at 7:36 pm

I don’t know for sure, sorry – you’ll have to test it yourself, but I think removing the columns in the function is a good idea.

Loading...

Reply

Very nice. Thanks for your work. This sort of optimization will help when working with dataflows on Power BI Premium or capacity

i don’t regret adding you to my list of morning readings. You’re resting right next to sqlbi, radacad & powerpivotpro. 🙂

Much needed section of Power Query optimizations! Looking forward to new posts, hoping one of them will also address the 256MB container size limit or can we vote an idea to make it a setting? 🙂

Hi Chris

Which of the below queries would perform better on large data sets?

Case 1: Removing Duplicates from a Column

Assuming D is a One Column Excel Table with a Column called Name

let
Source = Excel.CurrentWorkbook(){[Name=”D”]}[Content],
mRemDup = Table.Distinct(Source)
in
mRemDup

let
Source = Excel.CurrentWorkbook(){[Name=”D”]}[Content],
mRemDup = Table.Group(Source, {“Name”}, {})
in
mRemDup

Case2: Find items in One Table not present in the other

Consider 2 Single Column Tables called D and S each having a column called Name

let
Source = Excel.CurrentWorkbook(){[Name=”D”]}[Content],
mMergeS = Table.NestedJoin(Source, {“Name”}, S, {“Name”}, “S”, JoinKind.LeftOuter),
mXpandS = Table.ExpandTableColumn(mMergeS, “S”, {“Name”}, {“S.Name”}),
mFilterOutNull = Table.SelectRows(mXpandS, each ([S.Name] = null)),
mRemCols = Table.RemoveColumns(mFilterOutNull,{“S.Name”})
in
mRemCols

let
Source = Excel.CurrentWorkbook(){[Name=”D”]}[Content],
mMergeS = Table.NestedJoin(Source, {“Name”}, S, {“Name”}, “S”, JoinKind.LeftOuter),
mFilterOutNull = Table.SelectRows(mMergeS, each Table.IsEmpty([S])),
mRemCols = Table.RemoveColumns(mFilterOutNull,{“S”})
in
mRemCols

let
Source = Excel.CurrentWorkbook(){[Name=”D”]}[Content],
mMergeS = Table.NestedJoin(Source, {“Name”}, S, {“Name”}, “S”, JoinKind.LeftAnti),
mRemCol = Table.RemoveColumns(mMergeS,{“S”})
in
mRemCol

Case 3 : When combining Data from Multiple Files in a Folder

let
Source = Folder.Files(P{0}[Value]),
mCombine = Table.Combine(Table.AddColumn(Source, “Custom”, each fGetDataFile([Folder Path],[Name]))[Custom])
in
mCombine

The advantage here is the Consolidation is independent of the Column names – but is it faster than the below approach?

let
Source = Folder.Files(P{0}[Value]),
mAddCustCol = Table.AddColumn(Source, “Custom”, each fGetDataFile([Folder Path],[Name])),
mRemOthCols = Table.SelectColumns(mAddCustCol,{“Custom”}),
mXpandCols = Table.ExpandTableColumn(mRemOthCols, “Custom”, ColumNames)
in
mXpandCols

Chris Webb says:

August 11, 2019 at 7:27 pm

I don’t know, and I don’t think anyone could know without testing. Sorry!

Loading...

Reply
1. sam says:
  
  August 12, 2019 at 9:22 am
  
  I wish the Power Query team could publish a set of guidelines to follow
  
  Loading...

Why load the unnecessary columns at all?
In TransformColumnsType, only load needed columns.
“You should remove columns you don’t need in your dataset as early as possible in your queries”

nathancwatkins says:

August 10, 2019 at 1:18 am

You might need to grab information from other columns before deleting them. Or have a nested table column that you want to grab stuff from before deleting them. And many, many other reasons!

Loading...

Reply

Pingback: Getting Max Column Value from Power Query – Curated SQL

Pingback: Power Query/M Optimisation: Getting The Maximum Value From A Column, Part 2 « Chris Webb's BI Blog

Pingback: Power BI Paginated, DAX Fusion, Visuals and more... (August 12, 2019) | Guy in a Cube

Thanks for this post Chris, this helped me tonight! 🙂

Felipe Costa Gualberto says:

August 7, 2019 at 11:09 pm

Great post! So, if I have several CSVs (with a lot of columns) in a folder and use Folder import, the performance is better if I eliminate the excessive columns in the Sample (function) query instead in the Final one?

Loading...

1. Chris Webb says:
  
  August 11, 2019 at 7:36 pm
  
  I don’t know for sure, sorry – you’ll have to test it yourself, but I think removing the columns in the function is a good idea.
  
  Loading...
  
Donald Parish says:

August 8, 2019 at 12:22 am

Very nice. Thanks for your work. This sort of optimization will help when working with dataflows on Power BI Premium or capacity

Loading...

slap says:

August 8, 2019 at 9:02 am

i don’t regret adding you to my list of morning readings. You’re resting right next to sqlbi, radacad & powerpivotpro. 🙂

Loading...

Jani Zajc says:

August 8, 2019 at 9:25 am

Much needed section of Power Query optimizations! Looking forward to new posts, hoping one of them will also address the 256MB container size limit or can we vote an idea to make it a setting? 🙂

Loading...

sam says:

August 8, 2019 at 2:43 pm

Hi Chris

Which of the below queries would perform better on large data sets?

Case 1: Removing Duplicates from a Column

Assuming D is a One Column Excel Table with a Column called Name

let
Source = Excel.CurrentWorkbook(){[Name=”D”]}[Content],
mRemDup = Table.Distinct(Source)
in
mRemDup

OR

let
Source = Excel.CurrentWorkbook(){[Name=”D”]}[Content],
mRemDup = Table.Group(Source, {“Name”}, {})
in
mRemDup

Case2: Find items in One Table not present in the other

Consider 2 Single Column Tables called D and S each having a column called Name

let
Source = Excel.CurrentWorkbook(){[Name=”D”]}[Content],
mMergeS = Table.NestedJoin(Source, {“Name”}, S, {“Name”}, “S”, JoinKind.LeftOuter),
mXpandS = Table.ExpandTableColumn(mMergeS, “S”, {“Name”}, {“S.Name”}),
mFilterOutNull = Table.SelectRows(mXpandS, each ([S.Name] = null)),
mRemCols = Table.RemoveColumns(mFilterOutNull,{“S.Name”})
in
mRemCols

OR

let
Source = Excel.CurrentWorkbook(){[Name=”D”]}[Content],
mMergeS = Table.NestedJoin(Source, {“Name”}, S, {“Name”}, “S”, JoinKind.LeftOuter),
mFilterOutNull = Table.SelectRows(mMergeS, each Table.IsEmpty([S])),
mRemCols = Table.RemoveColumns(mFilterOutNull,{“S”})
in
mRemCols

OR

let
Source = Excel.CurrentWorkbook(){[Name=”D”]}[Content],
mMergeS = Table.NestedJoin(Source, {“Name”}, S, {“Name”}, “S”, JoinKind.LeftAnti),
mRemCol = Table.RemoveColumns(mMergeS,{“S”})
in
mRemCol

Case 3 : When combining Data from Multiple Files in a Folder

let
Source = Folder.Files(P{0}[Value]),
mCombine = Table.Combine(Table.AddColumn(Source, “Custom”, each fGetDataFile([Folder Path],[Name]))[Custom])
in
mCombine

The advantage here is the Consolidation is independent of the Column names – but is it faster than the below approach?

OR

let
Source = Folder.Files(P{0}[Value]),
mAddCustCol = Table.AddColumn(Source, “Custom”, each fGetDataFile([Folder Path],[Name])),
mRemOthCols = Table.SelectColumns(mAddCustCol,{“Custom”}),
mXpandCols = Table.ExpandTableColumn(mRemOthCols, “Custom”, ColumNames)
in
mXpandCols

Loading...

1. Chris Webb says:
  
  August 11, 2019 at 7:27 pm
  
  I don’t know, and I don’t think anyone could know without testing. Sorry!
  
  Loading...
  
  1. sam says:
    
    August 12, 2019 at 9:22 am
    
    I wish the Power Query team could publish a set of guidelines to follow
    
    Loading...
Michael Camiolo says:

August 9, 2019 at 11:32 am

Why load the unnecessary columns at all?
In TransformColumnsType, only load needed columns.
“You should remove columns you don’t need in your dataset as early as possible in your queries”

Loading...

1. nathancwatkins says:
  
  August 10, 2019 at 1:18 am
  
  You might need to grab information from other columns before deleting them. Or have a nested table column that you want to grab stuff from before deleting them. And many, many other reasons!
  
  Loading...
  
Pingback: Getting Max Column Value from Power Query – Curated SQL
Pingback: Power Query/M Optimisation: Getting The Maximum Value From A Column, Part 2 « Chris Webb's BI Blog
Pingback: Power BI Paginated, DAX Fusion, Visuals and more... (August 12, 2019) | Guy in a Cube
Mark Walter says:

May 11, 2020 at 3:49 am

Thanks for this post Chris, this helped me tonight! 🙂

Loading...

Power Query/M Optimisation: Getting The Maximum Value From A Column, Part 1

Like this:

Published by Chris Webb

14 thoughts on “Power Query/M Optimisation: Getting The Maximum Value From A Column, Part 1”

Leave a Reply to Jani ZajcCancel reply

Share this:

Like this:

Published by Chris Webb

14 thoughts on “Power Query/M Optimisation: Getting The Maximum Value From A Column, Part 1”

Leave a Reply to Jani ZajcCancel reply

Discover more from Chris Webb's BI Blog