Speed Up Dataflow Publishing/ Validation Times In Power BI And Fabric

If you’re working with slow data sources in Power BI/Fabric dataflows then you’re probably aware that validation (for Gen1 dataflows) or publishing (for Gen2 dataflows) them can sometimes take a long time. If you’re working with very slow data sources then you may run into the 10 minute timeout on validation/publishing that is documented here. For a Gen1 dataflow you’ll see the following error message if you try to save your dataflow and validation takes more than 10 minutes:

Failed to analyze issues in the query

For a Gen2 Dataflow, where you can save the Dataflow and publishing takes place in the background, you’ll see the following error in your workspace:

Dataflow publish failed

Apart from tuning your data source and tuning your queries, what can you do about this? Well one of the things that happens when you publish a dataflow is that it works out the columns returned, and the data types of those columns, for all of the queries in the dataflow. It does this by trying to run the queries until they return data by applying a top 0 row filter to them; if you can make that faster then validation/publishing will be faster. Obviously query folding is important here because that top 0 filter should fold, as are more obscure, source-specific settings like this one for ODBC sources. However, there is another trick that you can use if you are happy writing some moderately complicated M code – the trick I blogged about here for making Power Query in Power BI Desktop faster.

Let’s see an example with Dataflows Gen2. Conside the following M code which returns a table with three columns and is deliberately written to take 11 minutes and 1 second to return (see this post for more details on how to create artificially slow Power Query queries).

let
  Source = Function.InvokeAfter(
    () => 
    #table(
      type table
      [
        #"Number Column"=number,
        #"Text Column"=text,
        #"Date Column"=date
      ],
      {
        {1,"Hello",#date(2016,1,1)},
        {2,"World",#date(2017,12,12)}
      }
      )
    
    , 
    #duration(0, 0, 11, 1)
  )
in
  Source

As you would expect, trying to publish a Gen1 or Gen2 dataflow that uses this query will fail because it takes more than 10 minutes before it returns any rows. However in this case – as in most cases – you know what columns the query returns so it’s possible to use the Table.View M function to intercept the zero-row filter applied during validation/publishing and return a table with no rows in and the columns that the query above returns. You can do this by adding two extra steps in the M code like so:

let
  Source = Function.InvokeAfter(
    () => 
    #table(
      type table
      [
        #"Number Column"=number,
        #"Text Column"=text,
        #"Date Column"=date
      ],
      {
        {1,"Hello",#date(2016,1,1)},
        {2,"World",#date(2017,12,12)}
      }
      )
    
    , 
    #duration(0, 0, 11, 1)
  ),
  TableTypeToReturn = 
    type table
      [
        #"Number Column"=number,
        #"Text Column"=text,
        #"Date Column"=date
      ],
  OverrideZeroRowFilter = Table.View(
    null, 
    [
    GetType = () => 
      TableTypeToReturn, 
    GetRows = () => 
      Source, 
    OnTake = (count as number) => 
      if count = 0 then 
      #table(
        TableTypeToReturn, 
        {}
      ) 
    else 
     Table.FirstN(Source, count)]
  )
in
  OverrideZeroRowFilter

The first step added here, called TableTypeToReturn, defines the columns and data types of the table returned by the query; if you use this technique yourself, you will need to alter it so it returns the columns and data types of your query. You can read more about #table and table types here and I have a function that will automatically generate this code from an existing query for you here. The second step, called OverrideZeroRowFilter, looks for situations where a Top N filter is being applied and if N=0 returns a table of the type defined in the previous step with zero rows. For a more detailed explanation see that original blog post.

This new version of the query validates/publishes immediately, although it still takes 11 minutes and 1 second to refresh. Of course if you use this technique and then change your query so that different columns or data types are returned you have to update the extra code every time, which can be fiddly, but if you’re running into a timeout then you don’t have any choice and even if validation/publishing is slow it’s probably worth the extra effort.

13 thoughts on “Speed Up Dataflow Publishing/ Validation Times In Power BI And Fabric

  1. Hi Chris, I had this problem last week and solved it by adding a parameter that pulled 10 rows from the source if it was set to Yes and all rows if set to No. Dataflow then validated in a few seconds rather than timing out.

    Then I then exported the dataflow JSON file, edited it to change the parameter to No and imported it. Deleted the original and tidied things up then told it to refresh.

    Not sure if your method is easier or harder, but definitely there needs to be the option to force the save without validation.

    1. I use a similar approach, but instead of exporting and editing the JSON, I pass the parameter value from another primary dataflow. Whenever amendments are needed, I set the parameter value to 10 (which corresponds to “Yes” in your approach) to make the necessary changes, save, and test run. After that, I change the parameter value in the primary dataflow back to 100 (which corresponds to “No” in your approach) and refresh both dashboards.

      Regards,
      SM

  2. Hi Chris,
    This is great. I’ve had this problem and solved it in a similar way to the comment above.
    If the dataflow in question produces a table with a lot of columns, to make this work, it might be necessary to:
    – limit the number of rows,
    – publish,
    – run the code to determine the table type,
    – update the dataflow to include the calculated table type and the Table.View override; and remove the row limit,
    – publish again
    otherwise, it would be necessary to manually code the table type, which might be error prone if there are a good number of columns.

    Which leads me on to some questions … what happens if the result of the Dataflow doesn’t end up matching the table type? For example, if I update the code and forget to update the type? I’m guessing that the service is using the result to define a table in the background that it uses to store the results of the Dataflow. So will it break altogether with an error, or just work for the columns that match? Does this mean that in principle, it’s not possible or advisable to create a Dataflow that produces a table with a dynamic set of columns?

    Regards,
    Another Chris.

  3. This is an interesting piece of information and a new learning for me. Thanks for the post. Actually I try to solve issue with timeout by creating a query a view with limited (1) rows , saving the flow and changing the view limit . However it works in sources that enable views. Your solution now enables me to apply a trick to other sources. But as mentioned in other comments it can be a headache if the flow contains more tables with more columns and in the future when the source changes and columns added/removed one needs to be patient to review the code again and remember what done in the past . I hope MS would change the strategy in the future and finds a better way of getting the tables schema vs timeout of the schema query…
    This is a nightmare MS asking users to solve their lazy and simplified solutions issues, throwing the resolution implementation to users finding workarounds…
    Anyway, I tried the approach on a very slow query that return millions of rows , I tried to use 2 sets of source query (full for GetRows as in your sample and then query hard limited to 100 rows for the OnTake part instead of Table.FirstN function(for the simplified preview table excluding any subsequent unecessary table manipulations done in Source). This was a huge user experience I must admit in the speed of the writing the M code in PBI Desktop.

  4. I am running a complex query and I don’t know the data type of each column of the output. Can I fake this part ( put text data type for all columns)

    TableTypeToReturn =
    type table
    [
    #”Column 1″=text,
    #”Column 2″=text,
    #”Column 3″=text
    ],

    1. It works so long as the filter folds. If query folding doesn’t take place (the position of the step is irrelevant) then you’re going to get slow performance because all the data will be downloaded and filtered inside Power Query.

      1. A curious side effect (that differs from other functions that accept types like Table.ExpandTableColumn) is that it swaps values if TableTypeToReturn is not in the exact order of the prior step. e.g. for
        Source{0}[Number Column] = 1
        Source{0}[Date Column] = #date(2016,1,1)

        if you specified TableTypeToReturn = [
        #”Date Column”=date,
        #”Text Column”=text,
        #”Number Column”=number
        ]
        then
        Source{0}[Date Column] = 1
        Source{0}[Number Column] = #date(2016,1,1)

Leave a Reply