Power Query Performance And Expanding Columns

As a footnote to my previous post on storing large images in Power BI, I thought all you M nerds out there would be interested to read about a strange performance quirk I noticed while writing the code for that post.

My original attempt to write an M query to convert a folder full of images to text looked something like this:

let
    Source = 
        Folder.Files("C:\Users\Chris\Documents\PQ Pics"),
    SplitText = 
        (ImageBinaryText as text) => 
        let
            SplitTextFunction = 
                Splitter.SplitTextByRepeatedLengths(30000),
            SplitUpText = 
                SplitTextFunction(ImageBinaryText)
        in
            SplitUpText,
    #"Added Custom" = 
        Table.AddColumn(
            Source, 
            "Pic1", 
            each SplitText(
                Binary.ToText(
                    [Content], 
                    BinaryEncoding.Base64
                    )
                )
            ),
    #"Expanded Pic" = 
        Table.ExpandListColumn(
            #"Added Custom", 
            "Pic1"
            )
in
    #"Expanded Pic"

The approach I took was the one that seemed natural to me at the time:

  1. Use the Folder data source to connect to the folder containing the image files
  2. Define a function called SplitText that takes a long piece of text and splits it up into a list of text values no longer than 30000 characters
  3. Call the function once per row on the table returned by step (1)
  4. Use the Expand/Aggregate button to expand the new column created by step (3) and get a table with one row for each of the split-up text values

When I ran this query, though, I caught sight of something that is every Power Query developer’s worst nightmare:

image

Power Query had read 118MB of data from a file that is only 1.6MB: the old problem of multiple file reads. Using Process Monitor (as I describe here) confirmed it. I suspect the query was reading the whole file once for each of the split sections of text the function returned although I admit I didn’t confirm this.

I can’t say I knew what I was doing but I rewrote the query from scratch and came up with the code that I gave in the blog post which only reads from each file once (without using buffering too, I should point out). What’s the difference? I guess it must be the pattern of calling the function once per row in a table and then expanding using Table.ExpandListColumn that was to blame; I tried returning a table from the function instead of a list and the same thing happened. Maybe this is something we should avoid doing? More research is necessary, and, as always, I would be interested to hear about your experiences – it is after all a fairly common pattern.

6 responses

  1. Hi
    I haven’t been through exactly this situation, but one that is always a nightmate to me is when I need to run a “CountIf” at my query. I’ve got GigaBytes count in a 5 mb data source. Will keep an eye on it as well, I really need to work around it other than loading the query (before the countif) in excel, using excel to perform the countif, and then loading it back again to the model with the countif.

  2. Hi Chris,

    I’ve run into this often and have a theory. Lists undergo lazy evaluation and it appears that a list column adheres to this principle. It isn’t a column of lists, but rather a column of pointer code that tells Power Query how to go about getting the list when called. So when you expand the column, it can end up doing some odd things (depending on what generated the list).

  3. Pingback: Power Query List Expansion Problems – Curated SQL

  4. Chris, read through your blog(s) on factors that go in determining the performance of a power bi report like browser, client h/w, network latency etc. you have mentioned that you omitted data volumes as it has nothing to do with report loading time on browser. So does it mean that if we have 1 k records or 1 mil records in a import mode report using shared dataset, report visuals will take same time to appear? and it will be same for import mode reports without shared dataset? Also what about the environment – is it same in dev, qa and prod as its on power bi service (cloud) ?

      • Hi Chris, Thanks for your comments. Agree, query performance plays a bigger role in this aspect and there are endless ways to improve a report performance. I guess though we have the data handling limitations (ceilings) defined for power bi reports we don’t have figures to backup volume vs performance impact as such. But still its okay to factor it in though not the major one as you suggested.

Leave a Reply to Jay Sumners Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: