Improving Power Query Calculation Performance With List.Buffer()

I saw an interesting post the other day on the Power Query Technet forum which showed how the List.Buffer() function could be used to improve calculation performance. This is something I’d seen hinted at in other places so I thought it was worth a bit of investigation.

Consider the following query:

let
    //Connect to SQL Server
    Source = Sql.Database("localhost", "adventure works dw"),
    //Get first 2000 rows from FactInternetSales
    dbo_FactInternetSales = Table.FirstN(
          Source{[Schema="dbo",Item="FactInternetSales"]}[Data], 
          2000),
    //Remove unwanted columns
    RemoveColumns = Table.SelectColumns(
          dbo_FactInternetSales,
          {"SalesOrderLineNumber", "SalesOrderNumber","SalesAmount"}),
    //Get sorted list of values from SalesAmount column
    RankValues = List.Sort(RemoveColumns[SalesAmount], Order.Descending),
    //Calculate ranks
    AddRankColumn = Table.AddColumn(RemoveColumns , "Rank", 
          each List.PositionOf(RankValues,[SalesAmount])+1)
in
    AddRankColumn

 

It gets the first 2000 rows from the FactInternetSales table in the Adventure Works DW database, removes most of the columns, and adds a custom column that shows the rank of the current row based on its Sales Amount.

image

On my laptop it takes around 35 seconds to run this query – pretty slow, in my opinion, given the amount of data in this table.

However, using the List.Buffer() function in the RankValues step like so:

let
    //Connect to SQL Server
    Source = Sql.Database("localhost", "adventure works dw"),
    //Get first 2000 rows from FactInternetSales
    dbo_FactInternetSales = Table.FirstN(
          Source{[Schema="dbo",Item="FactInternetSales"]}[Data], 
          2000),
    //Remove unwanted columns
    RemoveColumns = Table.SelectColumns(
          dbo_FactInternetSales,
          {"SalesOrderLineNumber", "SalesOrderNumber","SalesAmount"}),
    //Get sorted list of values from SalesAmount column
    //And buffer them!
    RankValues = List.Buffer(List.Sort(RemoveColumns[SalesAmount], Order.Descending)),
    //Calculate ranks
    AddRankColumn = Table.AddColumn(RemoveColumns , "Rank", 
          each List.PositionOf(RankValues,[SalesAmount])+1)
in
    AddRankColumn

 

Makes the query run in just 2 seconds. The List.Buffer() function stores the sorted list of values used to calculate the rank in memory which means it will only be evaluated once; in the original query it seems as though this step and those before it are being evaluated multiple times. Curt Hagenlocher’s comment (on this thread) on what List.Buffer() does for a similar calculation is telling:

The reason for this is that M is both functional and lazy, so unless we buffer the output of List.Select, we’re really just building a query that needs to be evaluated over and over. This is similar to the Enumerable functions in LINQ, if you’re familiar with those.

Table.Buffer() and Binary.Buffer() functions also exist, and do similar things.

A few other points to make:

  • This is not necessarily the optimal way to calculate ranks in Power Query – it’s just an example of how List.Buffer() can be used.
  • In the first query above, query folding is not taking place. If it had been it’s likely that performance would have been better. Since using List.Buffer() explicitly prevents query folding from taking place, it could make performance worse rather than better because of this in many cases.
  • I’m 100% certain you’ll get much better performance for a rank calculation by loading the table to the Excel Data Model/Power Pivot and writing the calculation in DAX. You should only really do calculations like this in Power Query if they are needed for other transformations in your query.

17 thoughts on “Improving Power Query Calculation Performance With List.Buffer()

  1. Hm, that’s pretty cool. I’d wonder when exactly to use this though — pity PQ doesn’t really show much of what’s going on under the hood.
    So here you’d conclude you should buffer the result knowing you’re referring to an idempotent List function in an iteration?
    I’d wonder why only List/Table/Binary though — as if “M is both functional and lazy” would apply to those types.

    I suffered an issue of uncontrolled reloading when trying to use Power Query for scraping. Time to see if this’d solve that..
    The Load function I’d written previously to load functions from text files also suffered constant reloading, and may well benefit from Buffer.Binary() as well.

    Then again, with Excel 2016 exposing PQ to VBA, one could also just use VBA to import all PQ functions from some master spreadsheet. Hm, or to batch import queries from different files for that matter… I’m seeing some possibilities here. 🙂

  2. This is a topic that should have been better documented, as buffering lists (tables also) can have a significant positive impact like you’ve shown here and as shown in Curt’s example. I’ve written a lot of custom functions that takes one or more lists as inputs, and buffering these lists prior to passing to the functions can improve performance by several orders of magnitude.

    From what I can tell, buffering does two things: 1) Evaluates the list, table, or binary expression eagerly, 2) persists the value in memory so that so that it doesn’t have to be re-evaluated on subsequent calls. I think that stopping query folding on tables is a side effect of buffering, rather than being its main role.

    Thanks for the excellent post!

  3. To add a bit to informal documentation here… I tried using a buffer, found an error in my results due to bad input in a cell, then tried refreshing the query. Didn’t work — the buffer kept the bad value in. I then tried saving and reopening the sheet, after which a query refresh did succeed.

    So buffers persist through query refreshes, though not through workbook closes. The latter was expected I guess.

    Makes it a bit tricky though, when to buffer — trying to avoid unneeded recalculation, yet having the ability to recalculate all the parts that need a fix to go through while still aiming to avoid other unneeded recalculation. I suppose it only really becomes a relevant question in bigger data processing systems based on M though, which might not be a common use-case yet…

    I think this buffering should probably help address a lot of the issues I experienced with Power Query though, so as far as I’m concerned this will help push things toward bigger projects.

    This question of allowing the user more control over buffers is kind of interesting though. Then there’s the thing about persistence (across closes) as well though — it’s actually making me think if having M interface with something like Redis might help for that. Considering all it’d need would be HTTP requests and (de)serialization, that actually seems feasible to me too…

    On a similar note, I suppose using HTTP requests to communicate with a (local) server might actually also enable (indirectly) interfacing with other languages, COM APIs, message queues… I’m barely starting to fathom the extent of possibilities here.

  4. I used one of the buffer commands (I can’t remember which now and it’s late :-)) when doing an oData query to an Excel spreadsheet stored on a Office 365 SharePoint Document Library. Without the buffer command it was excruciatingly slow (like minutes to execute a few lines of M against about 30 lines of data in the spreadsheet). With the Buffer command it executed in seconds.

  5. Hi. Might be interesting. I’m curious as to how I could implement it for what I’m currently doing. Here’s my situation :
    I’m using PowerQuery to prep and merge fairly large DataSet (10 M rows, 30 cols, .csv feed). I’m using good hardware with 32Go RAM, yet I’m experiencing some difficulties.

    When loading either to PPVT or spreadsheet, excel freezes, takes humongous amount of time (10h +), or simply won’t finish very simple queries and the reason eludes me.

    RAM is not jammed at all, CPU is at about 30%, and it simply looks like the thing is not happening.

    Any advice on how to do merges / prep with such files in PQ ? (query folding isn’t an option since inputs are flat files on my laptop.

    Thanks !

      • I use CSV files for a lot of feeds into our reporting (doncha just love Sage Line 50 :-)) and it seems to me that you need 64-bit even for fairly small datasets, particularly if you are pulling data from SharePoint Lists as well.

      • Hi Chris,

        I’m using 64-bit version. I’m a PowerQuery fan for the GUI (saves a lot of time and avoids many mistakes), but i’m really astonished at how poorly M performs compared to say Pandas (python library).

        I did a quick benchmark on the same laptop for a simple task: Merging two 7M rows local .csv files on 4 columns, leaving aside the fact that you can’t write back to .csv in PowerQuery for large files (I hope they implement that soon).

        – Pandas merges and writes back to csv in about 2 min.
        – PowerQuery merges and exports it to PPVT in 20+ min, and you’re often left wondering if it will ever finish (especially when you do a filtering step somewhere).

        1/ Is there something I don’t get about this?
        2/ How come M’s native functions can’t even outperform a low performance lib like pandas?
        3/ Why can’t M just load both tables in RAM and be done with it? (raising the limit in cache doesn’t seem to do anything)

        I’m still gonna stick with M for dataprep since I make very few mistakes thanks to the GUI, but it performs really poorly for some tasks.

  6. @AC_6_9: To be fair, I don’t think Pandas really counts as low performance since Numpy just offloads most of the heavy lifting to BLAS, which is meticulously optimized native performance.
    More cynically speaking, I’d regard M as a toy language for its lack of performance profiling tools. Realistically, you’re gonna have no clue what the heck is going on under the hood, and to what extent it’s being silly recalculating the same things a bunch of times (which was my experience when inspecting network requests, IIRC not fixed after adding Buffer methods).

    • Hi tycho01,

      All evidence so far points to you being right. I guess I just don’t understand how you can reach such underwhelming performances after devoting MS-like resources to developing a language specifically design for data prep.

      Was this an unavoidable trade-off for the GUI? Or is the language just not mature enough?

      • @AC_6_9: When I used Power Query, I’d gotten to the point of just writing M instead of using the GUI. Which is probably further than most.

        I ended up feeling the language was a huge pain to write manually (having to swap around names and commas each time you add a line or swap too around), and made a comment to the devs on the official support forum to that extent. They don’t care.

        This lead me to believe that they themselves don’t take M seriously enough, and just see the GUI as the main attraction here — understandable, since arguably that was the main added value.
        AFAIK SQL Server hasn’t support M yet either, which is probably another sign, though at least Power BI did start using it.

        I’m gonna guess the main thing here is that Microsoft themselves probably do little using M though, which absolves them from really experiencing its pain points themselves. And I believe they did indeed market this whole thing as “self-service BI”.
        In my ears that does kind of translate to “this is not real enough for us to use in our own critical corporate projects, but hey, we’re marketing it to you since you probably can’t hire enough smart people to do real code”.

        I’ll have to concede my personal take-away is that language design and all of this is freaking hard, and hence they’re sticking with they’re own niche of “well, it can do a few things through a GUI”. By hard, I mean having to decide when to (re)calculate things (as apparently M feels the need to do), as well as order of execution w.r.t. side effects in functional languages.
        I’m not sure if you’re familiar with Haskell, but it’s one of the only other languages that can decide its own order of execution like M, but unlike it has also solved those questions w.r.t how to address order of execution in cases where it does matter. That learning curve though is probably why it hasn’t gained great traction.

        So as to maturity, I’m gonna go ahead and guess things ain’t about to change.
        You’ll wonder what I’m doing on this blog, since I’ve stopped using it. This was just one of a couple threads I just so turned out to still get reply notifications for.

        In your scenario, for tasks that you’d estimate would take too long to run on M, you’d probably be faster off just writing them out in Python instead.

      • I can’t speak for the Power Query/Power BI dev team, but I thought I’d add some comments here…

        Certainly there are lots of issues with performance that need to be addressed, but as far as I can see they are being worked on. One of the key features of Power Query – query folding, or the ability to translate transformations back into something the data source can understand, like SQL – is getting better with each monthly release. I know this isn’t relevant for text files, but it is relevant for many other scenarios. At the end of the day storing your data in something like a relational database and making M into a universal query generator is going to lead to better overall scalability than any approach which tries to solve performance problems on the desktop.

        As far as the dev environment goes I agree it’s painful. I have heard rumours that there will be improvements in the coming month but as Tycho says, the UI is the main selling point here, not the underlying language. Indeed it’s clear that Microsoft is promoting Python and R as languages for data analysis quite heavily (eg with their integration into Visual Studio, Azure Machine Learning etc) and M isn’t seen in the same way. Why would Microsoft invest in M as a language for data analysis and manipulation when R and Python already exist? M came before the UI, it’s true, and it’s important to have it available for situations where you need to do more complex things than the UI currently allows, but the UI is what allows a much broader range of users to work with data and that mass market is where Microsoft traditionally has most success.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s