Power Query Performance And Expanding Columns

Chris Webb M, Power Query May 26, 2019August 31, 2023 2 Minutes

As a footnote to my previous post on storing large images in Power BI, I thought all you M nerds out there would be interested to read about a strange performance quirk I noticed while writing the code for that post.

My original attempt to write an M query to convert a folder full of images to text looked something like this:

let
Source =
Folder.Files("C:\Users\Chris\Documents\PQ Pics"),
SplitText =
(ImageBinaryText as text) =>
let
SplitTextFunction =
Splitter.SplitTextByRepeatedLengths(30000),
SplitUpText =
SplitTextFunction(ImageBinaryText)
in
SplitUpText,
#"Added Custom" =
Table.AddColumn(
Source,
"Pic1",
each SplitText(
Binary.ToText(
[Content],
BinaryEncoding.Base64
)
)
),
#"Expanded Pic" =
Table.ExpandListColumn(
#"Added Custom",
"Pic1"
)
in
#"Expanded Pic"

The approach I took was the one that seemed natural to me at the time:

Use the Folder data source to connect to the folder containing the image files
Define a function called SplitText that takes a long piece of text and splits it up into a list of text values no longer than 30000 characters
Call the function once per row on the table returned by step (1)
Use the Expand/Aggregate button to expand the new column created by step (3) and get a table with one row for each of the split-up text values

When I ran this query, though, I caught sight of something that is every Power Query developer’s worst nightmare:

Power Query had read 118MB of data from a file that is only 1.6MB: the old problem of multiple file reads. Using Process Monitor (as I describe here) confirmed it. I suspect the query was reading the whole file once for each of the split sections of text the function returned although I admit I didn’t confirm this.

I can’t say I knew what I was doing but I rewrote the query from scratch and came up with the code that I gave in the blog post which only reads from each file once (without using buffering too, I should point out). What’s the difference? I guess it must be the pattern of calling the function once per row in a table and then expanding using Table.ExpandListColumn that was to blame; I tried returning a table from the function instead of a list and the same thing happened. Maybe this is something we should avoid doing? More research is necessary, and, as always, I would be interested to hear about your experiences – it is after all a fairly common pattern.

Published by Chris Webb

My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel. View all posts by Chris Webb

Published May 26, 2019August 31, 2023

6 thoughts on “Power Query Performance And Expanding Columns”

Rafael Moreno

Rafael Moreno says:

May 27, 2019 at 12:57 pm

Hi
I haven’t been through exactly this situation, but one that is always a nightmate to me is when I need to run a “CountIf” at my query. I’ve got GigaBytes count in a 5 mb data source. Will keep an eye on it as well, I really need to work around it other than loading the query (before the countif) in excel, using excel to perform the countif, and then loading it back again to the model with the countif.

Reply
Jay Sumners says:

May 29, 2019 at 3:13 am

Hi Chris,

I’ve run into this often and have a theory. Lists undergo lazy evaluation and it appears that a list column adheres to this principle. It isn’t a column of lists, but rather a column of pointer code that tells Power Query how to go about getting the list when called. So when you expand the column, it can end up doing some odd things (depending on what generated the list).

Reply
Pingback: Power Query List Expansion Problems – Curated SQL
Triparna Ray – Passionate Business Intelligence Expert offering vast experience leveraging software engineering and agile methodologies to deliver highly effective and creative solutions to business and technology challenges. Utilizes highly attuned analytical skills to develop BI solutions employing cutting-edge technologies to increase productivity. Consistently drives high standards of service through effective project management, communication, and strategic planning to develop and manage strong client relationships. Highly organized with strong capacity to prioritize workload, delegate deliverables, and steer project completion within established deadlines.

Triparna Ray says:

March 19, 2020 at 1:32 pm

Chris, read through your blog(s) on factors that go in determining the performance of a power bi report like browser, client h/w, network latency etc. you have mentioned that you omitted data volumes as it has nothing to do with report loading time on browser. So does it mean that if we have 1 k records or 1 mil records in a import mode report using shared dataset, report visuals will take same time to appear? and it will be same for import mode reports without shared dataset? Also what about the environment – is it same in dev, qa and prod as its on power bi service (cloud) ?

Reply
1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
  
  Chris Webb says:
  
  March 19, 2020 at 9:55 pm
  
  Query performance rather than data volume is very important in determining report performance, but they are not always linked.
  
  Reply
  1. Triparna Ray – Passionate Business Intelligence Expert offering vast experience leveraging software engineering and agile methodologies to deliver highly effective and creative solutions to business and technology challenges. Utilizes highly attuned analytical skills to develop BI solutions employing cutting-edge technologies to increase productivity. Consistently drives high standards of service through effective project management, communication, and strategic planning to develop and manage strong client relationships. Highly organized with strong capacity to prioritize workload, delegate deliverables, and steer project completion within established deadlines.
    
    Triparna Ray says:
    
    March 20, 2020 at 6:08 am
    
    Hi Chris, Thanks for your comments. Agree, query performance plays a bigger role in this aspect and there are endless ways to improve a report performance. I guess though we have the data handling limitations (ceilings) defined for power bi reports we don’t have figures to backup volume vs performance impact as such. But still its okay to factor it in though not the major one as you suggested.

Share this:

Published by Chris Webb

6 thoughts on “Power Query Performance And Expanding Columns”

Leave a ReplyCancel reply

Discover more from Chris Webb's BI Blog