Chris Webb's BI Blog: Power Query Performance And Expanding Columns

As a footnote to my previous post on storing large images in Power BI, I thought all you M nerds out there would be interested to read about a strange performance quirk I noticed while writing the code for that post.

My original attempt to write an M query to convert a folder full of images to text looked something like this:

let
Source =
Folder.Files("C:\Users\Chris\Documents\PQ Pics"),
SplitText =
(ImageBinaryText as text) =>
let
SplitTextFunction =
Splitter.SplitTextByRepeatedLengths(30000),
SplitUpText =
SplitTextFunction(ImageBinaryText)
in
SplitUpText,
#"Added Custom" =
Table.AddColumn(
Source,
"Pic1",
each SplitText(
Binary.ToText(
[Content],
BinaryEncoding.Base64
)
)
),
#"Expanded Pic" =
Table.ExpandListColumn(
#"Added Custom",
"Pic1"
)
in
#"Expanded Pic"

The approach I took was the one that seemed natural to me at the time:

Use the Folder data source to connect to the folder containing the image files
Define a function called SplitText that takes a long piece of text and splits it up into a list of text values no longer than 30000 characters
Call the function once per row on the table returned by step (1)
Use the Expand/Aggregate button to expand the new column created by step (3) and get a table with one row for each of the split-up text values

When I ran this query, though, I caught sight of something that is every Power Query developer’s worst nightmare:

Power Query had read 118MB of data from a file that is only 1.6MB: the old problem of multiple file reads. Using Process Monitor (as I describe here) confirmed it. I suspect the query was reading the whole file once for each of the split sections of text the function returned although I admit I didn’t confirm this.

I can’t say I knew what I was doing but I rewrote the query from scratch and came up with the code that I gave in the blog post which only reads from each file once (without using buffering too, I should point out). What’s the difference? I guess it must be the pattern of calling the function once per row in a table and then expanding using Table.ExpandListColumn that was to blame; I tried returning a table from the function instead of a list and the same thing happened. Maybe this is something we should avoid doing? More research is necessary, and, as always, I would be interested to hear about your experiences – it is after all a fairly common pattern.

6 thoughts on “Power Query Performance And Expanding Columns”

Hi
I haven’t been through exactly this situation, but one that is always a nightmate to me is when I need to run a “CountIf” at my query. I’ve got GigaBytes count in a 5 mb data source. Will keep an eye on it as well, I really need to work around it other than loading the query (before the countif) in excel, using excel to perform the countif, and then loading it back again to the model with the countif.

Hi Chris,

I’ve run into this often and have a theory. Lists undergo lazy evaluation and it appears that a list column adheres to this principle. It isn’t a column of lists, but rather a column of pointer code that tells Power Query how to go about getting the list when called. So when you expand the column, it can end up doing some odd things (depending on what generated the list).

Pingback: Power Query List Expansion Problems – Curated SQL

Chris, read through your blog(s) on factors that go in determining the performance of a power bi report like browser, client h/w, network latency etc. you have mentioned that you omitted data volumes as it has nothing to do with report loading time on browser. So does it mean that if we have 1 k records or 1 mil records in a import mode report using shared dataset, report visuals will take same time to appear? and it will be same for import mode reports without shared dataset? Also what about the environment – is it same in dev, qa and prod as its on power bi service (cloud) ?

Chris Webb says:

March 19, 2020 at 9:55 pm

Query performance rather than data volume is very important in determining report performance, but they are not always linked.

Loading...

Reply
1. Triparna Ray says:
  
  March 20, 2020 at 6:08 am
  
  Hi Chris, Thanks for your comments. Agree, query performance plays a bigger role in this aspect and there are endless ways to improve a report performance. I guess though we have the data handling limitations (ceilings) defined for power bi reports we don’t have figures to backup volume vs performance impact as such. But still its okay to factor it in though not the major one as you suggested.
  
  Loading...

Rafael Moreno says:

May 27, 2019 at 12:57 pm

Hi
I haven’t been through exactly this situation, but one that is always a nightmate to me is when I need to run a “CountIf” at my query. I’ve got GigaBytes count in a 5 mb data source. Will keep an eye on it as well, I really need to work around it other than loading the query (before the countif) in excel, using excel to perform the countif, and then loading it back again to the model with the countif.

Loading...

Jay Sumners says:

May 29, 2019 at 3:13 am

Hi Chris,

I’ve run into this often and have a theory. Lists undergo lazy evaluation and it appears that a list column adheres to this principle. It isn’t a column of lists, but rather a column of pointer code that tells Power Query how to go about getting the list when called. So when you expand the column, it can end up doing some odd things (depending on what generated the list).

Loading...

Pingback: Power Query List Expansion Problems – Curated SQL
Triparna Ray says:

March 19, 2020 at 1:32 pm

Chris, read through your blog(s) on factors that go in determining the performance of a power bi report like browser, client h/w, network latency etc. you have mentioned that you omitted data volumes as it has nothing to do with report loading time on browser. So does it mean that if we have 1 k records or 1 mil records in a import mode report using shared dataset, report visuals will take same time to appear? and it will be same for import mode reports without shared dataset? Also what about the environment – is it same in dev, qa and prod as its on power bi service (cloud) ?

Loading...

1. Chris Webb says:
  
  March 19, 2020 at 9:55 pm
  
  Query performance rather than data volume is very important in determining report performance, but they are not always linked.
  
  Loading...
  
  1. Triparna Ray says:
    
    March 20, 2020 at 6:08 am
    
    Hi Chris, Thanks for your comments. Agree, query performance plays a bigger role in this aspect and there are endless ways to improve a report performance. I guess though we have the data handling limitations (ceilings) defined for power bi reports we don’t have figures to backup volume vs performance impact as such. But still its okay to factor it in though not the major one as you suggested.
    
    Loading...

Power Query Performance And Expanding Columns

Like this:

Published by Chris Webb

6 thoughts on “Power Query Performance And Expanding Columns”

Leave a Reply to Jay SumnersCancel reply

Share this:

Like this:

Published by Chris Webb

6 thoughts on “Power Query Performance And Expanding Columns”

Leave a Reply to Jay SumnersCancel reply

Discover more from Chris Webb's BI Blog