Speed Up Data Refresh Performance In Power BI Desktop Using Table.View

It can sometimes be frustrating to work with slow data sources or complex Power Query queries in Power BI Desktop: you open the Power Query Editor, make some changes, click Close & Apply and then wait a loooong time for your data to refresh. In this post I’m going to show you a technique that can cut this wait by up to 50%. It involves some fairly complex M code but I promise you, the effort is worth it!

In fact, what I’m going to describe is more or less what I showed towards the end of my appearance on Guy In A Cube last year and in a few other posts, but at that time I didn’t understand properly why it worked or what the performance implications actually were. Now, thanks to a lot of help from Curt Hagenlocher of the Power Query development team I have all the details I need to blog about it.

Let’s see a simple example. Say you have a large JSON file – for this test I generated one that is 67MB containing random data using this handy online tool – and you want to load it into Power BI. You’ll end up with a query that looks something like this:

let
  Source = Json.Document(File.Contents("C:\generated.json")),
  #"Converted to Table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ),
  #"Expanded Column1" = Table.ExpandRecordColumn(
    #"Converted to Table", 
    "Column1", 
    {"_id", "index", "guid", "isActive", "balance"}, 
    {"_id", "index", "guid", "isActive", "balance"}
  ),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Expanded Column1", 
    {
      {"_id", type text}, 
      {"index", Int64.Type}, 
      {"guid", type text}, 
      {"isActive", type text}, 
      {"balance", type text}
    }
  )
in
  #"Changed Type"

Here’s what the output of this query looks like:

Output

On my laptop this query takes around 7-8 seconds to load. Using Process Monitor to see how much data is being read from the file (see this post for more details) shows that Power Query is reading the data from this file twice. This graph, generated using data from Process Monitor, has time on the x axis and amount of data read from the file on the y axis, and the two peaks indicate that the data is being read twice:

FirstLoad

Why? When you refresh a table in Power BI Desktop two things happen:

  • First, Power BI has to check what columns are present in the table. To do this, it runs the query for the table but asks the Power Query engine to filter the table so it returns zero rows.
  • Secondly the query is run again but this time with no filter, so all rows are returned, and this is when the data is actually loaded into the table in Power BI.

It’s on this first run of the query where problems can occur. If you’re using a data source like SQL Server and you’re not doing any complex transformations then query folding will take place, so Power BI’s request to return the table but with no rows can be handled very efficiently. However, if you’re using a data source where query folding is not possible (such as Excel, CSV or JSON files) or you have complex transformations that stop folding taking place then the only way Power Query can work out what columns the query returns is by running the entire query. That’s what is happening in the example above.

This only happens in Power BI Desktop, though. When you refresh a dataset in the Power BI Service the query is only run once.

The solution is to trick the Power Query engine into running these zero-row filter queries instantly, and you can do that using the Table.View M function. I blogged about Table.View here (and I strongly suggest you read that post before carrying on) but that example didn’t quite get to the solution you need here. Here’s a new version of the query above with an extra step:

let
  Source = Json.Document(File.Contents("C:\generated.json")),
  #"Converted to Table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ),
  #"Expanded Column1" = Table.ExpandRecordColumn(
    #"Converted to Table", 
    "Column1", 
    {"_id", "index", "guid", "isActive", "balance"}, 
    {"_id", "index", "guid", "isActive", "balance"}
  ),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Expanded Column1", 
    {
      {"_id", type text}, 
      {"index", Int64.Type}, 
      {"guid", type text}, 
      {"isActive", type text}, 
      {"balance", type text}
    }
  ),
  OverrideZeroRowFilter = Table.View(
    null, 
    [
    GetType = () => 
      type table[
      _id = Text.Type, 
      index = Int64.Type, 
      guid = Text.Type, 
      isAction = Text.Type, 
      balance = Text.Type
    ], 
    GetRows = () => 
      #"Changed Type", 
    OnTake = (count as number) => 
      if count = 0 then 
      #table(
      type table[
        _id = Text.Type, 
        index = Int64.Type, 
        guid = Text.Type, 
        isAction = Text.Type, 
        balance = Text.Type
      ], 
      {}
    ) 
    else 
     Table.FirstN(#"Changed Type", count)]
  )
in
  OverrideZeroRowFilter

The OverrideZeroRowFilter step is where the magic happens. It uses Table.View to override query folding behaviour by intercepting what happens when the table returned by the query is filtered. There are three fields in the record in the second parameter of Table.View that you need to change:

  • GetType returns a table type that describes the columns and their data types present in the output of the query. The six columns listed here are the six columns you can see in the screenshot of the query output above. It’s very easy to generate the required M code when you use the custom function that I blogged about here.
  • GetRows returns all the rows that the query can return, that’s to say the table returned by the #”Changed Type” step.
  • OnTake is used when a top n filter is applied to the table returned by the query.  In this case the code looks at the number of rows that are being requested (specified in the count parameter) and if that’s zero, it just returns an empty table with the same columns as the #”Changed Type” step; if it’s more than zero then it uses Table.FirstN to get the actual number of rows requested from #”Changed Type”. This means that when Power BI does that zero-row filter it can now happen immediately because there’s no need to go back to the data source or execute any of the transformations in the query.

This version of the query now runs in 4-5 seconds, and Process Monitor now shows that the JSON file is only read once and, obviously, reading the data once is a lot faster than reading it twice:

SecondLoad

Look at how the graph flattens at the end… but that’s something for another blog post.

There is a downside to this approach: you have to hard-code the table schema that you expect your query to return, and if you change your query to return different columns you’ll have to update the table type in the last step.

I’ve used this technique on other slow queries and it has made a massive difference to the development experience in Power BI Desktop – one query that was taking five minutes to refresh when I closed the Power Query Editor went down to two and a half minutes. If you try this out yourself please let me know if it makes a difference by leaving a comment.

[Don’t forget that there are other things you can do that will also speed up the development experience for slow queries: you should definitely turn off the “Allow data preview to download in the background” option, and you might want to consider turning off data privacy checks so long as you fully understand what the implications]

 

18 responses

  1. Hi Chris, thanks for sharing such insights, helps a lot with optimization of queries.
    A few questions, if you allow.
    1) When we query a folder with JSON or CSV files, do you suggest to use this trick with Table.View in the function used for each file transformation or in the main query where info if combined? I believe should be in the function, where we read binary.
    2) Do you think this trick will help to optimize queries in Excel where source is a range/table (Excel.CurrentWorkbook() is used)
    3) Where to place this trick if we have a chain of queries? Query1 (to the source) -> Q2 referencing Q1 -> Q3 is based on results of Q2 (subset). Always in Q1?
    Thanks in advance,
    Ivan

    • 1) and 3) If you’re going to implement this you should always do it in the last step of a query that is being loaded into the dataset, not earlier
      2) I don’t know, but I suspect it will help with reading data from Excel. If you can test it please let me know what you find!

  2. Very interesting. However, I would be concerned if this reached production, as the hard coded schema could cause problems if the underlying JSON changed in an innocuous way that would otherwise still work (new info field, etc). Could a “Dev Mode” be implemented, perhaps via a parameter, which used the above method when true, but defaulted to the normal behaviour when false? It could be changed to false for the beta testing phase onward, changed to true during bug fixing, subsequent change requests, etc..

    • >>Could a “Dev Mode” be implemented
      That’s exactly what I would like… but I’m not the one who makes the decisions on which features to implement 😉

  3. What are the performance counters you used to run this test?
    I tried to run the same test but I didn’t see any performance difference. Maybe I have a too fast computer and I should use a bigger dataset, but the amount of data transferred in the dialog box during the process seems the same.

    Another interesting side effect of this feature is to make sure that an empty file would not break the transformation. This is very important reading files from a directory.

    Suggestion: generating such a transformation should be available in an advanced task of Power Query, because doing this work manually is really time consuming and error-prone!

    However, while trying to apply the same optimization on a real transformation, I discovered that accessing to AzureDataLake source by specifying the root URL. I used it in a transformation transformed in a function and the SampleFile was pointing to the root URL (it is a ADLS Gen2) with tens of thousands of files. The second step was a filter of a single sample file, but I found that the download of the list of all the files was taking 15 seconds for any evaluation/preview/refresh. By using the URL to a folder with a small number of files I’ve removed these 15 seconds!

    • I was using a combination of a stopwatch and the timings for reading data from Process Monitor (see the post referenced above from a couple of years ago). I believe there are other optimisations that can kick in to make the zero-row read much faster even on non-foldable data sources which may explain what you’re seeing – are you using a csv file maybe? I have a suspicion I know what might be happening but I need to do more research. I deliberately chose a JSON file because I know it has to be completely read into memory before it can be loaded, which will be the subject of another blog post soon.

      That tip about accessing the root vs individual files is very interesting. Do you have “Allow data preview to download in the background” turned off?

  4. Hi Chris!
    I just tried this approach in a model connected to the SQL server which had some complex transformations and it was taking really long to load. In fact, I applied all the optimizations you suggested here (turning off “Allow data preview to download in the background” and “Data privacy checks”).
    I can tell you it was a massive reduction of loading time! I need to measure the time it was taken before and how long it is taking now. But I can risk saying that it was around 60% faster than the original time with all these optimizations.
    That was a really great tip!

  5. Dear Chris
    Thank you for your very helpful post as follow-up to a similar post in 2018.

    The logic of Table.View helps presumably a lot in the context of Excel PowerQuery in combination with PowerPivot – I didn’t test yet, but I guess in such environments, usually the source is read twice and the caching is quite poor compared to PowerBI.

    Isn’t there a possibility to wrap your logic into a custom function which takes a table as single input parameter and returns a table? The function itself would evaluate the input table structure (attributes and datatypes) first to create/execute desired Table.View statement.

    Actually it would be great, when this kind of optimization would become part of the Power Query core logic, wouldn’t it ?

  6. Pingback: Chris Webb's BI Blog: Monitoring Power Query Memory Usage With Query Diagnostics In Power BI Chris Webb's BI Blog

  7. Pingback: Power BI in Brief – May 2020 - Microsoft Dynamics NAV Community

  8. I need to run two SQL queries ie first call a stored procedure and then a select statement to display the data. The stored procedure takes a while to run so I want to have it run once. I have tried multiple ways but the M query always calls the Stored Procedure twice. I am using a ODBC connection to run the SQLs. If I call only the stored procedure then it runs once. The problem occurs if I call the select statement after it.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: