Power Query Performance And Expanding Columns

As a footnote to my previous post on storing large images in Power BI, I thought all you M nerds out there would be interested to read about a strange performance quirk I noticed while writing the code for that post.

My original attempt to write an M query to convert a folder full of images to text looked something like this:

    Source = 
        Folder.Files("C:\Users\Chris\Documents\PQ Pics"),
    SplitText = 
        (ImageBinaryText as text) => 
            SplitTextFunction = 
            SplitUpText = 
    #"Added Custom" = 
            each SplitText(
    #"Expanded Pic" = 
            #"Added Custom", 
    #"Expanded Pic"

The approach I took was the one that seemed natural to me at the time:

  1. Use the Folder data source to connect to the folder containing the image files
  2. Define a function called SplitText that takes a long piece of text and splits it up into a list of text values no longer than 30000 characters
  3. Call the function once per row on the table returned by step (1)
  4. Use the Expand/Aggregate button to expand the new column created by step (3) and get a table with one row for each of the split-up text values

When I ran this query, though, I caught sight of something that is every Power Query developer’s worst nightmare:


Power Query had read 118MB of data from a file that is only 1.6MB: the old problem of multiple file reads. Using Process Monitor (as I describe here) confirmed it. I suspect the query was reading the whole file once for each of the split sections of text the function returned although I admit I didn’t confirm this.

I can’t say I knew what I was doing but I rewrote the query from scratch and came up with the code that I gave in the blog post which only reads from each file once (without using buffering too, I should point out). What’s the difference? I guess it must be the pattern of calling the function once per row in a table and then expanding using Table.ExpandListColumn that was to blame; I tried returning a table from the function instead of a list and the same thing happened. Maybe this is something we should avoid doing? More research is necessary, and, as always, I would be interested to hear about your experiences – it is after all a fairly common pattern.

3 responses

  1. Hi
    I haven’t been through exactly this situation, but one that is always a nightmate to me is when I need to run a “CountIf” at my query. I’ve got GigaBytes count in a 5 mb data source. Will keep an eye on it as well, I really need to work around it other than loading the query (before the countif) in excel, using excel to perform the countif, and then loading it back again to the model with the countif.

  2. Hi Chris,

    I’ve run into this often and have a theory. Lists undergo lazy evaluation and it appears that a list column adheres to this principle. It isn’t a column of lists, but rather a column of pointer code that tells Power Query how to go about getting the list when called. So when you expand the column, it can end up doing some odd things (depending on what generated the list).

  3. Pingback: Power Query List Expansion Problems – Curated SQL

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: