As a footnote to my previous post on storing large images in Power BI, I thought all you M nerds out there would be interested to read about a strange performance quirk I noticed while writing the code for that post.
My original attempt to write an M query to convert a folder full of images to text looked something like this:
let Source = Folder.Files("C:\Users\Chris\Documents\PQ Pics"), SplitText = (ImageBinaryText as text) => let SplitTextFunction = Splitter.SplitTextByRepeatedLengths(30000), SplitUpText = SplitTextFunction(ImageBinaryText) in SplitUpText, #"Added Custom" = Table.AddColumn( Source, "Pic1", each SplitText( Binary.ToText( [Content], BinaryEncoding.Base64 ) ) ), #"Expanded Pic" = Table.ExpandListColumn( #"Added Custom", "Pic1" ) in #"Expanded Pic"
The approach I took was the one that seemed natural to me at the time:
- Use the Folder data source to connect to the folder containing the image files
- Define a function called SplitText that takes a long piece of text and splits it up into a list of text values no longer than 30000 characters
- Call the function once per row on the table returned by step (1)
- Use the Expand/Aggregate button to expand the new column created by step (3) and get a table with one row for each of the split-up text values
When I ran this query, though, I caught sight of something that is every Power Query developer’s worst nightmare:
Power Query had read 118MB of data from a file that is only 1.6MB: the old problem of multiple file reads. Using Process Monitor (as I describe here) confirmed it. I suspect the query was reading the whole file once for each of the split sections of text the function returned although I admit I didn’t confirm this.
I can’t say I knew what I was doing but I rewrote the query from scratch and came up with the code that I gave in the blog post which only reads from each file once (without using buffering too, I should point out). What’s the difference? I guess it must be the pattern of calling the function once per row in a table and then expanding using Table.ExpandListColumn that was to blame; I tried returning a table from the function instead of a list and the same thing happened. Maybe this is something we should avoid doing? More research is necessary, and, as always, I would be interested to hear about your experiences – it is after all a fairly common pattern.