Power Query Performance And Expanding Columns

As a footnote to my previous post on storing large images in Power BI, I thought all you M nerds out there would be interested to read about a strange performance quirk I noticed while writing the code for that post.

My original attempt to write an M query to convert a folder full of images to text looked something like this:

let
Source =
Folder.Files("C:\Users\Chris\Documents\PQ Pics"),
SplitText =
(ImageBinaryText as text) =>
let
SplitTextFunction =
Splitter.SplitTextByRepeatedLengths(30000),
SplitUpText =
SplitTextFunction(ImageBinaryText)
in
SplitUpText,
#"Added Custom" =
Table.AddColumn(
Source,
"Pic1",
each SplitText(
Binary.ToText(
[Content],
BinaryEncoding.Base64
)
)
),
#"Expanded Pic" =
Table.ExpandListColumn(
#"Added Custom",
"Pic1"
)
in
#"Expanded Pic"

The approach I took was the one that seemed natural to me at the time:

  1. Use the Folder data source to connect to the folder containing the image files
  2. Define a function called SplitText that takes a long piece of text and splits it up into a list of text values no longer than 30000 characters
  3. Call the function once per row on the table returned by step (1)
  4. Use the Expand/Aggregate button to expand the new column created by step (3) and get a table with one row for each of the split-up text values

When I ran this query, though, I caught sight of something that is every Power Query developer’s worst nightmare:

Power Query had read 118MB of data from a file that is only 1.6MB: the old problem of multiple file reads. Using Process Monitor (as I describe here) confirmed it. I suspect the query was reading the whole file once for each of the split sections of text the function returned although I admit I didn’t confirm this.

I can’t say I knew what I was doing but I rewrote the query from scratch and came up with the code that I gave in the blog post which only reads from each file once (without using buffering too, I should point out). What’s the difference? I guess it must be the pattern of calling the function once per row in a table and then expanding using Table.ExpandListColumn that was to blame; I tried returning a table from the function instead of a list and the same thing happened. Maybe this is something we should avoid doing? More research is necessary, and, as always, I would be interested to hear about your experiences – it is after all a fairly common pattern.

6 thoughts on “Power Query Performance And Expanding Columns

  1. Hi
    I haven’t been through exactly this situation, but one that is always a nightmate to me is when I need to run a “CountIf” at my query. I’ve got GigaBytes count in a 5 mb data source. Will keep an eye on it as well, I really need to work around it other than loading the query (before the countif) in excel, using excel to perform the countif, and then loading it back again to the model with the countif.

  2. Hi Chris,

    I’ve run into this often and have a theory. Lists undergo lazy evaluation and it appears that a list column adheres to this principle. It isn’t a column of lists, but rather a column of pointer code that tells Power Query how to go about getting the list when called. So when you expand the column, it can end up doing some odd things (depending on what generated the list).

  3. Triparna Ray – Passionate Business Intelligence Expert offering vast experience  leveraging software engineering and agile methodologies to deliver  highly effective and creative solutions to business and technology  challenges. Utilizes highly attuned analytical skills to develop BI solutions  employing cutting-edge technologies to increase productivity.  Consistently drives high standards of service through effective project  management, communication, and strategic planning to develop and  manage strong client relationships. Highly organized with strong capacity  to prioritize workload, delegate deliverables, and steer project  completion within established deadlines. 
    Triparna Ray says:

    Chris, read through your blog(s) on factors that go in determining the performance of a power bi report like browser, client h/w, network latency etc. you have mentioned that you omitted data volumes as it has nothing to do with report loading time on browser. So does it mean that if we have 1 k records or 1 mil records in a import mode report using shared dataset, report visuals will take same time to appear? and it will be same for import mode reports without shared dataset? Also what about the environment – is it same in dev, qa and prod as its on power bi service (cloud) ?

    1. Chris Webb – My name is Chris Webb, and I work on the Fabric CAT team at Microsoft. I blog about Power BI, Power Query, SQL Server Analysis Services, Azure Analysis Services and Excel.
      Chris Webb says:

      Query performance rather than data volume is very important in determining report performance, but they are not always linked.

      1. Triparna Ray – Passionate Business Intelligence Expert offering vast experience  leveraging software engineering and agile methodologies to deliver  highly effective and creative solutions to business and technology  challenges. Utilizes highly attuned analytical skills to develop BI solutions  employing cutting-edge technologies to increase productivity.  Consistently drives high standards of service through effective project  management, communication, and strategic planning to develop and  manage strong client relationships. Highly organized with strong capacity  to prioritize workload, delegate deliverables, and steer project  completion within established deadlines. 
        Triparna Ray says:

        Hi Chris, Thanks for your comments. Agree, query performance plays a bigger role in this aspect and there are endless ways to improve a report performance. I guess though we have the data handling limitations (ceilings) defined for power bi reports we don’t have figures to backup volume vs performance impact as such. But still its okay to factor it in though not the major one as you suggested.

Leave a ReplyCancel reply