Delaying Power BI Dataset Refresh Until The Source Data Is Ready

This week a customer came to me with the following problem: they had scheduled the refresh of their dataset but their source data wasn’t always ready in time, so the old data was being loaded by mistake. The best solution here is to use some kind of external service (for example Power Automate) to poll the data source regularly to see if it’s ready, and then to refresh the dataset via the Power BI REST API when it is. However, it got me thinking about a different way of tackling this: is it possible to write some M code that will do the same thing? It turns out that it is, but it’s quite complicated – so I don’t recommend you use the code below in the real world. Nevertheless I wanted to write up the solution I came up with because it’s interesting and you never know, it might be useful one day.

Here’s the scenario I implemented. Using a SQL Server data source, let’s say that you want to load the result of the following SQL query against the AdventureWorksDW2017 database into Power BI:

SELECT DISTINCT [EnglishDayNameOfWeek] FROM DimDate

However, let’s also say that the data in this table may not be ready at the time when dataset refresh is scheduled; when the data is ready, the sole value in the sole column of another table, called ContinueFlagTable, will be changed from False to True:

Therefore when dataset refresh starts we need to check the value from the ContinueFlagTable regularly; if it is False we need to wait for a given amount of time and then check again; if it is True the data can be loaded; if the flag hasn’t changed to True after checking a certain number of times then we need to raise an error.

Here’s the full M code of the Power Query query:

let
  //The SQL Server database
  Source = Sql.Database(
      "MyServerName", 
      "AdventureWorksDW2017"
    ),
  //The query to load into the dataset
  ResultQuery
    = Value.NativeQuery(
        Source, 
        "SELECT DISTINCT [EnglishDayNameOfWeek] FROM DimDate"
      ),
  //The number of times to check the Continue flag
  MaxIterations = 3,
  //The number of seconds to wait before checking the Continue flag
  DelaySeconds = 5,
  //Check the Continue flag
  Iterate = List.Generate(
      () => 1, 
      each (
      //Query the ContinueFlag table
      //after waiting the specified number of seconds
      //and keep trying until either it returns true
      //or we reach the max number of times to check(Function.InvokeAfter(
          () => 
            let
              dbo_ContinueFlag
                = Value.NativeQuery(
                    Source, 
                    "SELECT [ContinueFlagColumn] from ContinueFlagTable where -1<>@param", 
                    [
                      param
                        = _
                    ]
                  ),
              Continue
                = dbo_ContinueFlag{0}[ContinueFlagColumn]
            in
              Continue, 
          #duration(
              0, 
              0, 
              0, 
              DelaySeconds
            )
        )
        = false
      )
        and (_ < MaxIterations)
      ), 
      each _ + 1
    ),
  //Find how many times the ContinueFlag was checked
  NumberOfIterations
    = List.Max(Iterate),
  //Did we reach the max number of checks?
  ReturnError
    = NumberOfIterations
      = MaxIterations
      - 1,
  //Table type of the table to load into the dataset
  ReturnTableType
    = type table [
      EnglishDayNameOfWeek = Text.Type
    ],
  //Error message to return if the
  //max number of checks is reached
  ErrorMessage
    = error "After waiting "
      &amp; Text.From(
        MaxIterations
        * DelaySeconds
      )
      &amp; " seconds your data is still not ready to be loaded",
  //Handle Power BI Desktop's behaviour of asking for
  //the top 0 rows from the table
  //before loading the actual data
  OverrideZeroRowFilter = 
    if ReturnError then 
      ErrorMessage
    else 
      Table.View(
          null, 
          [
            GetType = () => 
              ReturnTableType, 
            GetRows = () => 
              ResultQuery, 
            OnTake
              = (
                count as number
              ) => 
                if count = 0 then 
                  #table(
                      ReturnTableType, 
                      {}
                    )
                else 
                  Table.FirstN(
                      ResultQuery, 
                      count
                    )
          ]
        )
in
  OverrideZeroRowFilter

There’s a lot to explain here:

  • Everything is in a single query, and this is deliberate: I found it was the only way I could make sure that the query to load the data (ie the query in the ResultQuery step) is run only once, after the check on the ContinueFlagTable has returned true. If I split the code to get the data off into a separate query, I found that it got queried before any checks on ContinueFlagTable; this seemed to be something to do with the formula firewall, but I’m not sure.
  • I’ve used List.Generate to do the polling of ContinueFlagTable. There’s a good example of how to use it to implement Do Loops here.
  • I’ve used Function.InvokeAfter to add the delays in between checks on ContinueFlagTable.
  • I found that if I ran the same query to check ContinueFlagTable, after the first execution the result was cached and the query couldn’t tell if the flag changed after that. Therefore I wrote a SQL query with a WHERE clause that was different each time but which always returned true; I did this by passing the iteration number into the query via a parameter, and I passed the parameter in using the Value.NativeQuery function.
  • It’s possible to raise an error when needed using the error keyword, described here.
  • The OverrideZeroRowFilter step at the end uses the technique I blogged about here to deal with the fact that Power BI Desktop asks for the top 0 rows from a table to get the schema before it loads the data.

Refreshing the table in Power BI Desktop with the value True in ContinueFlagTable loads the data successfully; in Profiler you can see one query to ContinueFlagTable and one query to get the data to be loaded from the DimDate table:

Refreshing the table with the value False in ContinueFlagTable shows the specified number of queries to ContinueFlagTable and no query to DimDate:

Here’s the error message you get in Power BI Desktop when this happens:

Here’s the same error in the Power BI Service when you try to refresh the dataset set:

Visualising Azure Analysis Services Processing Tasks With The Job Graph Events Sample

I’ll leave it to other people to get worked up about things like Power BI Premium gen2 – what really gets me excited is a new Profiler event type. And indeed there is a new one to play with – Job Graph – that you can use to visualise Azure Analysis Services processing jobs (it’s not enabled yet for the Analysis Services engine in Power BI).

More details about it, and how it can be used, are in the samples here:

https://github.com/microsoft/Analysis-Services/tree/master/ASJobGraphEvents

The data returned by the Job Graph event isn’t intelligible if you look at the text it returns in Profiler. However if you save a .trc file with Job Graph event data to XML you can use the Python scripts in the GitHub repo to generate DGML diagrams that can be viewed in Visual Studio, plus Gantt charts embedded in HTML. Of course to do this you’ll need to have Python installed; you’ll also need to have Visual Studio and its DGML editor installed (see here for details).

Here’s the DGML diagram for the refresh of a very simple AAS database with just one table in it, called People:

The eight nodes in a semi-circle around the top of the diagram represent the eight jobs that process the attribute hierarchies in the table. These eight jobs must be completed before the single partition in the table is processed (this is the node in the centre of the diagram) and that partition must be processed before the table itself is processed (this is the node at the bottom of the diagram).

Since this is the annotated output you can hover over a node in the diagram and see more details about it, including its duration and how long it was blocked for:

Notice how three nodes have dark backgrounds? These nodes represent the critical path. From the sample documentation:

When the job graph executes, there is always a job that finishes last before the engine can commit the change. This job that finishes last is the “critical dependency” for the commit; the entire commit needs to wait for this one job to finish before executing. This last job depends on other jobs, one of which finished after all the others. This is the next critical dependency. Tracing this path of critical dependencies forms the critical path, which helps engineers and customers identify why processing takes so long.

To diagnose slow refresh times, look for the critical path and start at the top. Look at “Blocked duration”, “Waiting duration”, and “Running duration”. If a job has a long blocked duration, it spent a long time waiting on other jobs. If a job has a long waiting duration, it was waiting for an available thread, so increasing maxParallelism could help. If a job has a long running duration, then the model might have to be changed to speed up the job.

Here’s another diagram, generated from a slightly more complex database with four tables, a calculated column and a calculated table:

There’s also a script that takes this data and generates a Gantt chart from it, but at the time of writing it seems to be a bit buggy; I can make it work if I comment out some code but I suspect that means the output isn’t accurate. Hopefully it will get fixed soon, but in the meantime here’s an example of what it produces:

I think this is going to be really useful when I’m troubleshooting processing performance problems. It would also be really cool to have this in DAX Studio (hint, hint)…

Handling Multi-select In Power BI Dynamic M Parameters

Even though the documentation for dynamic M parameters does mention how to handle multi-select in the M code for your Power Query queries, I thought it would be useful to provide a detailed example of how to do this and explain what happens behind the scenes when you use multi-select.

Once again I’m going to use Azure Data Explorer as my DirectQuery data source in this post. Consider the following simple KQL query, which returns a list of counties in the UK and the average price paid for real estate in them:

pricepaid
| where county in ('DEVON', 'CORNWALL', 'KENT')
| summarize avg(price) by county

The third line of this query uses a combination of the KQL where and in operators to filter the list of counties in a way that’s very similar to other query languages such as SQL. The challenge is to write an M expression that will generate the comma-delimited list of county names in parantheses at the end of this line.

Taking the query above and generating a non-dynamic DirectQuery table in Power BI is straightforward, and let’s assume that you have already created an M parameter called Selected County:

… a table of county names:

…and bound the County column of this table to the SelectedCounty M parameter and turned on the Multi-select option in Model View in the main Power BI Desktop window:

Now the confusing thing for me, when I first started to look at this problem, was that the M parameter called SelectedCounty I created was of type Text but the example code in the documentation was written for an M parameter of type list – when in fact it isn’t possible to create an M parameter of type list in the Power Query Editor (at least not at the time of writing). It turns out that when the Multi-select option is turned on Power BI is able to send a value of type list to the parameter regardless of what type you have defined for it.

With that knowledge here’s an example of an M query to generate the query using the parameter:

let
  CountyList = 
    if 
    //check to see if the parameter is a list
      Type.Is(
        Value.Type(SelectedCounty), 
        List.Type
      ) then 
        //if it is a list
        let
          //add single quotes around each value in the list
          AddSingleQuotes = List.Transform(
              SelectedCounty, 
              each "'" &amp; _ &amp; "'"
            ),
          //then turn it into a comma-delimited list
          DelimitedList = Text.Combine(
              AddSingleQuotes, 
              ","
            )
        in
          DelimitedList
    else 
      //if the parameter isn't a list
      //just add single quotes around the parameter value
      "'" &amp; SelectedCounty &amp; "'",
  //generate and run the KQL query
  Source = AzureDataExplorer.Contents(
      "https://mycluster.northeurope.kusto.windows.net", 
      "pricepaid", 
      "pricepaid#(lf)
       | where county in 
       ("
        &amp; CountyList &amp;
       ")#(lf)
       | summarize avg(price) by county", 
      [
        MaxRows                 = null, 
        MaxSize                 = null, 
        NoTruncate              = null, 
        AdditionalSetStatements = null
      ]
    )
in
  Source

Note that the CountyList step has to check the data type of the parameter using an if statement, because in the Query Editor it will always be Text whereas in the report it will be a list if multi-select is turned on and Text if not.

Here’s the final report showing a multiselect slicer passing values into this query:

[Thanks to Ravi Kiran Vemulapalli and Sujata Narayana for their help with this post]

Geospatial Analysis With Azure Data Explorer, Power BI And Dynamic M Parameters

Since last week’s blog post about dynamic M parameters generated so much interest, this week I thought I’d give you another example of something cool you can do with them when you’re using Azure Data Explorer (ADX) as a DirectQuery source in Power BI: geospatial analysis.

Let’s say you work for a chain of supermarkets and want to use Power BI see what other competing stores are close to one of your stores. First of all you need some data, in in this case my data source was this list of British supermarkets and their locations , published by the nice people at Geolytix, and called Retail Points. Once again I loaded the csv file into ADX and was able to write a KQL query using the geo_point_in_circle() function to filter the list of supermarkets down to those within a specified distance of a given supermarket. For example, here’s a simple KQL query that finds all the supermarkets within a 5000m radius of the location of my local Marks and Spencer store in Amersham:

let centrestore =  
retailpoints
| where store_name == "M&amp;S Amersham SF"
| project long_wgs, lat_wgs;
let centrelat = toscalar( centrestore | summarize max(lat_wgs));
let centrelong = toscalar ( centrestore | summarize max(long_wgs));
retailpoints
| where geo_point_in_circle(long_wgs, lat_wgs, centrelong, centrelat, 5000)

[Apologies if this isn’t the greatest KQL query – I’m still learning]

I was then able to create two M parameters called selectedstore and selectedradius and bind them to two other tables in my dataset using dynamic M parameters so that the end user could use slicers to select:

  • The store they want to use as the central point of their analysis
  • The size in metres of the radius of the circle around the central point within which other stores should be displayed

These M parameters could then be injected into the KQL query of a DirectQuery table using the following M query:

let
  Source = AzureDataExplorer.Contents(
      "https://mycluster.northeurope.kusto.windows.net", 
      "retailpoints", 
      Text.Format(
          "declare query_parameters(filterstore:string = ""#[storeMparametername]"", centreradius:int = #[radiusMparametername]);#(lf)let centrestore =  #(lf)retailpoints#(lf)| where store_name == filterstore#(lf)| project long_wgs, lat_wgs;#(lf)let centrelat = toscalar( centrestore | summarize max(lat_wgs));#(lf)let centrelong = toscalar ( centrestore | summarize max(long_wgs));#(lf)retailpoints#(lf)| where geo_point_in_circle(long_wgs, lat_wgs, centrelong, centrelat, centreradius)", 
          [
            storeMparametername  = selectedstore, 
            radiusMparametername = selectedradius
          ]
        ), 
      [
        MaxRows                 = null, 
        MaxSize                 = null, 
        NoTruncate              = null, 
        AdditionalSetStatements = null
      ]
    )
in
  Source

There are two important things to notice here:

  • I declared KQL query parameters to hold the selected store name and selected radius at the start of the KQL query. As the documentation notes, this is important to guard against KQL injection attacks.
  • I used the M Text.Format function to inject the M parameter values into the KQL query parameter declarations, which was easier and cleaner than generating the KQL query string by concatenating text.

And here’s the report I built, for example showing how you can select an Aldi store in Aberdeen and display all stores within a 1km, 3km and 7km radius of that store:

Not the most sophisticated geospatial report in the world, I admit, and I know very few people are using Azure Data Explorer as a data source for Power BI right now, but I do think it is a fun example of the kind analysis that dynamic M parameters now make possible.

%d bloggers like this: