Handling Multiple URL Query Parameters With The Same Name Using Web.Contents In Power Query/Power BI

Some time ago I wrote a pair of popular posts about using the Query and RelativePath options of the Web.Contents function in Power Query and why they are important for dataset refresh. I have recently learned something extra about this subject which merits a new post, though: how to handle multiple URL query parameters with the same name.

In the examples for this post I’ll be using a free, fake web service available at http://jsonplaceholder.typicode.com/ which requires no authentication so you will be able to run the code I show yourself. Let’s start by considering the following call to the comments resource of this API:

http://jsonplaceholder.typicode.com/comments?postId=1

In M, you can use the Web.Contents function and the Query option to call the API like so:

//Generates the URL http://jsonplaceholder.typicode.com/comments?postId=1
Web.Contents("http://jsonplaceholder.typicode.com/comments",[Query=[postId="1"]])

Now this API – unlike many others – allows you to pass multiple URL query parameters of the same name. For example this is a valid call:

http://jsonplaceholder.typicode.com/comments?postId=1&postId=2

The problem is, if you try to do the following in M:

Web.Contents("http://jsonplaceholder.typicode.com/comments",[Query=[postId="1",postId="2"]])

…you’ll get an error, because the Query option take a record and a record can’t have two fields with the same name.

Here’s where it gets interesting. In the working example above, the postId field in the Query record contains a text value. The field can contain a list of text values instead, though, and this is how you solve the problem:

//Generates the URL http://jsonplaceholder.typicode.com/comments?postId=1&postId=2
Web.Contents("http://jsonplaceholder.typicode.com/comments",[Query=[postId={"1","2"}]])

This results in a URL with two query parameters that are both called postId but have the values 1 and 2 respectively.

There’s something else worth mentioning about this too. Sometimes when you’re generating a URL you may not want to add a query parameter to it if a value is null. One way of handling this is to start with an empty record and then add fields to it using the Record.AddField function, but using an empty list provides another approach. For example:

//Generates the URL http://jsonplaceholder.typicode.com/comments
//and *not* http://jsonplaceholder.typicode.com/comments?postId=
Web.Contents("http://jsonplaceholder.typicode.com/comments",[Query=[postId={}]])

…generates a URL without the postId query parameter in. This means you can write a simple function with an optional parameter for postId like this:

(optional myPostId as text) =>
Json.Document(
    Web.Contents("http://jsonplaceholder.typicode.com/comments",[Query=[postId=myPostId ?? {}]])
    )

If a text value is passed to myPostId then the postId query parameter is added to the URL; if no value is passed, myPostId is null and the ?? null coalescing operator (see Ben Gribaudo’s post here for more information on this operator) can be used to replace it with an empty list.

One last thing to mention: if you need to see the web service calls generated by Web.Contents when you’re testing in the Power Query Editor, you can use the query diagnostics functionality as I show here.

[I’m extremely grateful to Dave Maiden for letting me know about all this – it answers a question I have been asked several times in the past]

Handling Multi-select In Power BI Dynamic M Parameters

Even though the documentation for dynamic M parameters does mention how to handle multi-select in the M code for your Power Query queries, I thought it would be useful to provide a detailed example of how to do this and explain what happens behind the scenes when you use multi-select.

Once again I’m going to use Azure Data Explorer as my DirectQuery data source in this post. Consider the following simple KQL query, which returns a list of counties in the UK and the average price paid for real estate in them:

pricepaid
| where county in ('DEVON', 'CORNWALL', 'KENT')
| summarize avg(price) by county

The third line of this query uses a combination of the KQL where and in operators to filter the list of counties in a way that’s very similar to other query languages such as SQL. The challenge is to write an M expression that will generate the comma-delimited list of county names in parantheses at the end of this line.

Taking the query above and generating a non-dynamic DirectQuery table in Power BI is straightforward, and let’s assume that you have already created an M parameter called Selected County:

… a table of county names:

…and bound the County column of this table to the SelectedCounty M parameter and turned on the Multi-select option in Model View in the main Power BI Desktop window:

Now the confusing thing for me, when I first started to look at this problem, was that the M parameter called SelectedCounty I created was of type Text but the example code in the documentation was written for an M parameter of type list – when in fact it isn’t possible to create an M parameter of type list in the Power Query Editor (at least not at the time of writing). It turns out that when the Multi-select option is turned on Power BI is able to send a value of type list to the parameter regardless of what type you have defined for it.

With that knowledge here’s an example of an M query to generate the query using the parameter:

let
  CountyList = 
    if 
    //check to see if the parameter is a list
      Type.Is(
        Value.Type(SelectedCounty), 
        List.Type
      ) then 
        //if it is a list
        let
          //add single quotes around each value in the list
          AddSingleQuotes = List.Transform(
              SelectedCounty, 
              each "'" & _ & "'"
            ),
          //then turn it into a comma-delimited list
          DelimitedList = Text.Combine(
              AddSingleQuotes, 
              ","
            )
        in
          DelimitedList
    else 
      //if the parameter isn't a list
      //just add single quotes around the parameter value
      "'" & SelectedCounty & "'",
  //generate and run the KQL query
  Source = AzureDataExplorer.Contents(
      "https://mycluster.northeurope.kusto.windows.net", 
      "pricepaid", 
      "pricepaid#(lf)
       | where county in 
       ("
        & CountyList &
       ")#(lf)
       | summarize avg(price) by county", 
      [
        MaxRows                 = null, 
        MaxSize                 = null, 
        NoTruncate              = null, 
        AdditionalSetStatements = null
      ]
    )
in
  Source

Note that the CountyList step has to check the data type of the parameter using an if statement, because in the Query Editor it will always be Text whereas in the report it will be a list if multi-select is turned on and Text if not.

Here’s the final report showing a multiselect slicer passing values into this query:

[Thanks to Ravi Kiran Vemulapalli and Sujata Narayana for their help with this post]

Geospatial Analysis With Azure Data Explorer, Power BI And Dynamic M Parameters

Since last week’s blog post about dynamic M parameters generated so much interest, this week I thought I’d give you another example of something cool you can do with them when you’re using Azure Data Explorer (ADX) as a DirectQuery source in Power BI: geospatial analysis.

Let’s say you work for a chain of supermarkets and want to use Power BI see what other competing stores are close to one of your stores. First of all you need some data, in in this case my data source was this list of British supermarkets and their locations , published by the nice people at Geolytix, and called Retail Points. Once again I loaded the csv file into ADX and was able to write a KQL query using the geo_point_in_circle() function to filter the list of supermarkets down to those within a specified distance of a given supermarket. For example, here’s a simple KQL query that finds all the supermarkets within a 5000m radius of the location of my local Marks and Spencer store in Amersham:

let centrestore =  
retailpoints
| where store_name == "M&S Amersham SF"
| project long_wgs, lat_wgs;
let centrelat = toscalar( centrestore | summarize max(lat_wgs));
let centrelong = toscalar ( centrestore | summarize max(long_wgs));
retailpoints
| where geo_point_in_circle(long_wgs, lat_wgs, centrelong, centrelat, 5000)

[Apologies if this isn’t the greatest KQL query – I’m still learning]

I was then able to create two M parameters called selectedstore and selectedradius and bind them to two other tables in my dataset using dynamic M parameters so that the end user could use slicers to select:

  • The store they want to use as the central point of their analysis
  • The size in metres of the radius of the circle around the central point within which other stores should be displayed

These M parameters could then be injected into the KQL query of a DirectQuery table using the following M query:

let
  Source = AzureDataExplorer.Contents(
      "https://mycluster.northeurope.kusto.windows.net", 
      "retailpoints", 
      Text.Format(
          "declare query_parameters(filterstore:string = ""#[storeMparametername]"", centreradius:int = #[radiusMparametername]);#(lf)let centrestore =  #(lf)retailpoints#(lf)| where store_name == filterstore#(lf)| project long_wgs, lat_wgs;#(lf)let centrelat = toscalar( centrestore | summarize max(lat_wgs));#(lf)let centrelong = toscalar ( centrestore | summarize max(long_wgs));#(lf)retailpoints#(lf)| where geo_point_in_circle(long_wgs, lat_wgs, centrelong, centrelat, centreradius)", 
          [
            storeMparametername  = selectedstore, 
            radiusMparametername = selectedradius
          ]
        ), 
      [
        MaxRows                 = null, 
        MaxSize                 = null, 
        NoTruncate              = null, 
        AdditionalSetStatements = null
      ]
    )
in
  Source

There are two important things to notice here:

  • I declared KQL query parameters to hold the selected store name and selected radius at the start of the KQL query. As the documentation notes, this is important to guard against KQL injection attacks.
  • I used the M Text.Format function to inject the M parameter values into the KQL query parameter declarations, which was easier and cleaner than generating the KQL query string by concatenating text.

And here’s the report I built, for example showing how you can select an Aldi store in Aberdeen and display all stores within a 1km, 3km and 7km radius of that store:

Not the most sophisticated geospatial report in the world, I admit, and I know very few people are using Azure Data Explorer as a data source for Power BI right now, but I do think it is a fun example of the kind analysis that dynamic M parameters now make possible.

Why I’m Excited About Dynamic M Parameters In Power BI

My favourite feature in the October 2020 release of Power BI is undoubtedly dynamic M parameters. However, before I explain why, I’m going to put my hands up and admit that we screwed up the communication around its launch and didn’t explain its limitations properly – which led to some understandable disappointment and confusion.

Although the blog post and the docs have now been updated to make it clear that this feature doesn’t work with many popular data sources like SQL Server, Synapse, Oracle, Teradata and SAP (see here for the full details), I thought I’d explain why. It’s because there are two sets of connector that Power BI uses for DirectQuery: an older set that was built some time ago for the data sources originally supported by Analysis Services Tabular, and a newer set that is built in Power Query’s M language. Dynamic M parameters only work for this newer set of connectors; if you want to see this feature extended so it works for the older connectors too then please vote on ideas.powerbi.com, talk to your Microsoft account team, call Satya or whatever your normally do.

OK, so why am I excited? Well, because dynamic M parameters let you do a whole bunch of cool things that weren’t possible before. My original plan for this blog post was to show how they can be used to change the dimension column used in a visual dynamically but Mim beat me to it with this blog post. Instead in this post I’m going to show you how you can use them to create something like dynamic calculated columns and I also want to make a more general point that will Blow Your Mind (hopefully).

For my example I’m going to use Azure Data Explorer (ADX) as my data source. There are a couple of reasons why: it’s something I’ve been interested to learn more about for a while, and the ADX team recently announced a bunch of new features which make me think it will be increasingly useful in BI scenarios; it will be important to know it if you’re using the new Azure Monitor integration for Power BI log data; and, crucially, it’s one of the few M-based connectors where you can write your own queries in DirectQuery mode (although this is coming to Snowflake and Redshift next year too).

My source data is from the UK Land Registry Price Paid dataset, and for my demo report I loaded a table containing around 790,000 real estate transactions into ADX. Each real estate transaction has the price paid for the property:

One obvious way to analyse this data is to build a histogram and show the number of transactions that occur in price bins such as £0-£100000, £100000-£200000, £200000-£300000 and so on. This is quite easy to do in Power BI, for example using calculated columns in Import mode, unless you want to allow your users to be able to change the bin sizes themselves. Dynamic M parameters give you an easy way to solve this problem though.

KQL, the query language for ADX, has a really nice way of generating bins for histograms using the summarize() operator and bin() function. For example, the following KQL query does something like a SQL group by and shows the number of real estate transactions for each county and bins based on price in £25000 increments:

pricepaid 
| summarize Transactions=count() 
by county, bin(price, 25000)

In Power BI I created a parameter called binsizes:

…and then created a parameterised version of the KQL query above to populate a DirectQuery table. Here’s the M code for this table:

let
    Source = 
     AzureDataExplorer.Contents(
      "https://mycluster.northeurope.kusto.windows.net", 
      "pricepaid", 
      "pricepaid | summarize Transactions=count() by county, bin(price, " 
      & Number.ToText(binsizes) & 
      ")", 
      [MaxRows=null, MaxSize=null, NoTruncate=null, AdditionalSetStatements=null]
     )
in
    Source

Next I created a new table with the possible bin sizes that the user can choose using a DAX calculated table:

And then finally linked the column in this new table with the M parameter:

Here’s the final report, where you can see the number of transactions in each price bin and use slicers to change the price bins and change the county you’re looking at:

Apart from being a new solution to an old problem, what does this mean? Well, with a data source like ADX where M parameters are supported and you can write your own queries in DirectQuery mode, you can do whatever you want in the query so long as it always returns the same columns. As well as creating parameterised queries, in ADX you could also create a user-defined function (something like table-valued functions in SQL Server), pass parameters into it using M parameters, and have a totally dynamically generated fact table in your dataset. Either way, using dynamic M parameters in this way allows you to use KQL’s specialised functions for basket analysis or geospatial analysis to do things that are otherwise impossible in Power BI. I’m sure similar things are possible with other data sources too. And that, my friends, is why I’m so excited about dynamic M parameters.

A Function To Visualise Parallelism In Power Query Diagnostics Data

Most of the time I’ve spent looking at Power Query Query Diagnostics data, I’ve been looking at data for a single query. Over the past few days though I’ve spent some time investigating what this data shows for all the queries that are executed for a single dataset refresh. To help me do this I wrote the following M function:

(InputTable as table) => 
  let
    Source = InputTable,
    EarliestStart = List.Min(Source[Start Time]),
    AddRelativeStart = Table.AddColumn(
        Source, 
        "Relative Start", 
        each [Start Time] - EarliestStart
      ),
    AddRelativeEnd = Table.AddColumn(
        AddRelativeStart, 
        "Relative End", 
        each [End Time] - EarliestStart
      ),
    SetToDurations = Table.TransformColumnTypes(
        AddRelativeEnd, 
        {
          {"Relative Start", type duration}, 
          {"Relative End", type duration}
        }
      ),
    CalculatedTotalSeconds
      = Table.TransformColumns(
          SetToDurations, 
          {
            {
              "Relative Start", 
              Duration.TotalSeconds, 
              type number
            }, 
            {
              "Relative End", 
              Duration.TotalSeconds, 
              type number
            }
          }
        ),
    GroupedRows = Table.Group(
        CalculatedTotalSeconds, 
        {"Id", "Query"}, 
        {
          {
            "Relative Start", 
            each List.Min([Relative Start]), 
            type number
          }, 
          {
            "Relative End", 
            each List.Max([Relative End]), 
            type number
          }
        }
      ),
    MergeColumns = Table.AddColumn(
        GroupedRows, 
        "Id Query", 
        each Text.Combine({[Id], [Query]}, " "), 
        type text
      ),
    IdToNumber = Table.TransformColumnTypes(
        MergeColumns, 
        {{"Id", type number}}
      )
  in
    IdToNumber

If you invoke this function and pass in the “Detailed” query diagnostics data query:

…you’ll end up with a query that gives you a table that looks something like this:

This table has one row for each Power Query query that got executed while the diagnostics trace was running, an Id column to uniquely identify each execution, the name of the query executed, an Id Query column that concatenates the previous two columns, and Relative Start and Relative End columns that give you the number of seconds from the start time of the first query executed up to the start time and end time of the query on the current row.

Loading the data into Power BI allows you to build a stacked bar chart with Id Query on the axis and Relative Start and Relative End together in the values:

If you then set the Sort By Column property of Id Query to the Id column (which is numeric, so this ensures that the values in Id Query are sorted correctly) and set the Data Color property of the Relative Start values to white (or whatever the background of your chart is) so that it’s invisible, then finally set the axis of your stacked bar chart to be sorted by Id Query in ascending order, you get a visual something like this:

This is a kind of Gantt chart where the x axis shows the number of seconds since the start of the first query execution and each bar shows the start and end times of each query, making it easy to see which queries are executing in parallel.

“What can this show us?” I hear you say. Well, that’s something for a future blog post. All that I can say right now is that it’s COMPLICATED and I don’t properly understand it myself yet. But it’s undoubtedly interesting.

Optimising The Performance Of Power Query Merges In Power BI, Part 5: Cross Joins

In a late addition to the series of posts that I started here a few months ago, I thought it would be useful to look at how to optimise the performance of a rare (but nonetheless useful) type of merge: the cross join. Cross joins are not available as a join kind in the Table.Join or Table.NestedJoin M functions but there are two ways of creating them in Power Query:

  1. Creating a custom column in your query that contains the query that you want to cross join with and then expanding, as Matt Allington describes here
  2. Creating a custom column that contains a single value, such as the number 1, on both your queries and then using that column to merge the two queries on (something I mentioned a long time ago here)

Which one gives you the best performance? Apart from my old blog post there’s at least one other post suggesting the second approach is much faster, but I thought it would be useful to do some proper testing.

For my source data I created a query called SourceQuery that returned a table with one thousand rows, containing all the integers from 1 to 1000:

To test the first approach, I created a second query that referenced SourceQuery, then added a custom column that also returned the SourceQuery query:

…and then expanded this new custom column. The result was a query that returned a million rows – the cross join of the thousand-row SourceQuery table with itself.

Here’s the M code:

let
  Source = SourceQuery,
  #"Added Custom" = Table.AddColumn(
      Source, 
      "Custom", 
      each SourceQuery
    ),
  #"Expanded Custom" = Table.ExpandTableColumn(
      #"Added Custom", 
      "Custom", 
      {"Column1"}, 
      {"Custom.Column1"}
    ),
  #"Changed Type" = Table.TransformColumnTypes(
      #"Expanded Custom", 
      {{"Custom.Column1", Int64.Type}}
    )
in
  #"Changed Type"

The two timings from Profiler (the same ones that I have used to measure refresh performance throughout this series) were:

  • Progress Report End/25 Execute SQL – 0.03 seconds
  • Progress Report End/17 Read Data – 3.80 seconds

To test the second approach I added a custom column to the SourceQuery query that contained the value 1:

…and then created a new query that joined this query to itself:

…and then did an expand to achieve the same cross joined output as in the first test. Here’s the M code (all created in the Power Query Editor – no custom M code required):

let
  Source = Table.NestedJoin(
      #"SourceQuery With Join Column", 
      {"JoinColumn"}, 
      #"SourceQuery With Join Column", 
      {"JoinColumn"}, 
      "SourceQuery With Join Column", 
      JoinKind.Inner
    ),
  #"Expanded SourceQuery With Join Column"
    = Table.ExpandTableColumn(
        Source, 
        "SourceQuery With Join Column", 
        {"Column1"}, 
        {"SourceQuery With Join Column.Column1"}
      ),
  #"Removed Columns" = Table.RemoveColumns(
      #"Expanded SourceQuery With Join Column", 
      {"JoinColumn"}
    )
in
  #"Removed Columns"

The performance of this query was much faster:

  • Progress Report End/25 Execute SQL – 0.03 seconds
  • Progress Report End/17 Read Data – 0.80 seconds

So, as you can see, the second approach is the one to use.

There’s another advantage of this second approach too, if you’re using a foldable data source like SQL Server: it is possible to make query folding happen, which is of course incredibly important from a performance point of view, although you have to be careful not to change the data type of your custom column (or at least be careful how you do it). For example, here’s the M for a query that gets the cross join of the DimScenario table from the Adventure Works DW database with itself:

let
  Source = Sql.Databases("localhost"),
  AdventureWorksDW2017 = Source{[Name
    = "AdventureWorksDW2017"]}[Data],
  dbo_DimScenario = AdventureWorksDW2017{[
    Schema = "dbo", 
    Item = "DimScenario"
  ]}[Data],
  #"Removed Columns" = Table.RemoveColumns(
      dbo_DimScenario, 
      {"FactFinance"}
    ),
  #"Added Custom" = Table.AddColumn(
      #"Removed Columns", 
      "Custom", 
      each 1
    ),
  #"Merged Queries" = Table.NestedJoin(
      #"Added Custom", 
      {"Custom"}, 
      #"Added Custom", 
      {"Custom"}, 
      "Added Custom", 
      JoinKind.Inner
    ),
  #"Expanded Added Custom"
    = Table.ExpandTableColumn(
        #"Merged Queries", 
        "Added Custom", 
        {"ScenarioKey", "ScenarioName", "Custom"}, 
        {
          "Added Custom.ScenarioKey", 
          "Added Custom.ScenarioName", 
          "Added Custom.Custom"
        }
      ),
  #"Removed Other Columns" = Table.SelectColumns(
      #"Expanded Added Custom", 
      {
        "ScenarioName", 
        "Added Custom.ScenarioName"
      }
    )
in
  #"Removed Other Columns"

Here’s the SQL code generated by this query:

select [$Outer].[ScenarioName] as [ScenarioName],
    [$Inner].[ScenarioName2] as [Added Custom.ScenarioName]
from 
(
    select [_].[ScenarioKey] as [ScenarioKey],
        [_].[ScenarioName] as [ScenarioName],
        1 as [Custom]
    from [dbo].[DimScenario] as [_]
) as [$Outer]
inner join 
(
    select [_].[ScenarioKey] as [ScenarioKey2],
        [_].[ScenarioName] as [ScenarioName2],
        1 as [Custom2]
    from [dbo].[DimScenario] as [_]
) as [$Inner] on ([$Outer].[Custom] = [$Inner].[Custom2])

Displaying Routes On A Map In Power BI

In last week’s post I described the new Power Query M functions for working with Well Known Text objects, and in a post a few weeks ago I showed how you can use the Icon Map custom visual to display Well Known Text data. In this post I’ll show you how you can put all this together to do something really useful: display routes on a map in a Power BI report.

First of all you’ll need to download the latest version of the Icon Map custom visual here, because at the time of writing the version in AppSource doesn’t have the WKT functionality needed. Second, you’ll need a data source with a table of latitudes and longitudes that represent points on a route that you want to plot. I’m going to use the Azure Maps Get Route Directions API as my data source; to keep things simple I’m going to use the Shared Key authentication method for this API, but I’ll explain below why this can be a problem in the real world. To follow the example you’ll need to create an Azure Maps Account in the Azure Portal so you can get your own key.

Here’s a simple example to start off with. The following M code:

let
  Source = Json.Document(
      Web.Contents(
          "https://atlas.microsoft.com/route/directions/json", 
          [ApiKeyName = "subscription-key", Query
            = [
            #"api-version" = "1.0", 
            query
              = "51.502243,-0.140073:51.517777,-0.138621", 
            travelMode = "pedestrian"
          ]]
        )
    ),
  routes = Source[routes],
  routes1 = routes{0},
  legs = routes1[legs],
  legs1 = legs{0},
  points = legs1[points],
  PointsList = List.Transform(
      points, 
      each GeographyPoint.From(
          [longitude], 
          [latitude]
        )
    ),
  LineRecord = [Kind = "LINESTRING", Points
    = PointsList],
  WKTLineString = Geography.ToWellKnownText(
      LineRecord
    ),
  ToTable = #table(
      type table[
        Category = number, 
        WKT = text, 
        Size = number
      ], 
      {{1, WKTLineString, 1}}
    )
in
  ToTable

…calls the Azure Maps Get Route Directions API to get a walking route between two locations in central London. When you run the query you’ll be prompted to authenticate – choose the Web API option and enter your key. It’s a very basic example of how the API can be used: I have assumed there’s only one leg in the route, then taken the list of latitudes and longitudes for that single leg, turned that into a list of points and then turned that list into a linestring as I showed in last week’s blog post. Here’s the output in the Icon Map visual:

This is what the table returned by the query looks like:

The Category and Size columns aren’t very interesting; Icon Map needs these columns to display data. The WKT column contains the linestring object. Here’s how the visual is set up in Power BI:

We can do better than this though. In the example pbix file I’ve uploaded here, I’ve created a more sophisticated query that asks for six different routes between the two locations used above that all go via one waypoint (so there are two legs in the route data returned). I won’t go through the M code because there’s nothing interesting in it, but the end result is a report that allows you to plot these six different routes in different colours on a map and use a slicer to hide and show them, along with some information on their length and travel time:

One last thing to point out regarding these examples is that they won’t refresh if you publish them to the Power BI Service because I’ve used the ApiKeyName option in Web.Contents to handle key authentication (see here) and that isn’t supported in the Service yet. To work around this you’ll need to hard code the Azure Maps API key in your query and use the Skip Test Connection option in the Power BI Service.

Adding Tooltips To Column Headers In The Power Query Editor In Power BI And Excel

Here’s something that will excite all you M nerds out there! Did you know you can make tooltips appear when you hover over column headers in a table in the Power Query Editor in either Power BI Desktop or Excel?

It’s all done with M metadata, and here’s an example query:


let
source =
#table({"firstname", "lastname"}, {{"matt", "masson"}}),
tableType =
type table[firstname = Text.Type, lastname = Text.Type]
meta [
Documentation.FieldDescription =
[firstname = "Given Name", lastname = "Family Name"]
],
replaceType = Value.ReplaceType(source, tableType)
in
replaceType

Here’s what you see when your mouse hovers over the firstname column in the Power Query Editor:

Capture1

…and here’s what you see when your mouse hovers over the lastname column:

Capture2

How does this work? Here’s what each of the steps do:

  • The source step creates a simple table with two columns called firstname and lastname using #table (see here for more details on that).
  • The tabletype step declares a new table type with two columns (the same two columns in the table from the previous step) and then adds a metadata record to this type. In that record the Documentation.FieldDescription field contains the text values that will appear as tooltips when you hover over each column.
  • The replacetype step replaces the type of the table returned by source with the type declared in tabletype.

The Power Query Editor UI then looks for a Documentation.FieldDescription field in any metadata associated with a table and displays the values in that field when you hover over the appropriate column.

Note that if you add any steps to your query after this that change the table type (for example that add or remove columns), the metadata is removed ☹. That said I still feel like this might be a useful feature for anyone building a custom connector, for example.

[Thanks to Matt Masson for telling me how all this works]

 

 

Optimising The Performance Of Power Query Merges In Power BI, Part 3: Table.Join And SortMerge

In the last two posts in this series I showed how removing columns from the tables used in a Power Query merge operation can improve refresh performance. In this post I’ll show you a different trick that – when you can use it – can give you an equally significant performance boost.

When you merge data from two queries in the Power Query Editor the M code generated uses the Table.NestedJoin function. There is, however, another M function that can be used to merge data: Table.Join. The interesting thing about this function is that has a parameter that Table.NestedJoin doesn’t have: the joinAlgorithm parameter allows you to specify the algorithm used by the Power Query engine for the merge. The documentation doesn’t tell you what the possible values for this parameter are but #shared shows them:

Algs

The SortMerge algorithm, last in the list above, is the focus of this blog post. I mentioned in my earlier posts that the reason that merge operations on non-foldable data sources are often slow is that both of the tables used in the merge need to be held in memory. There is an exception though: if you know that the data in the columns used to join the two tables is sorted in ascending order, you can use the Table.Join function and the SortMerge algorithm and the data from both sources can be streamed rather than held in memory, which in turn results in the merge being much faster.

Here’s an example. As before there are two source queries that take data from a one million row CSV file, but for this test no columns are removed and there’s no filter on rows. The two source queries, called First and Second, are almost identical (in First the columns are named A1 to G1 and in Second the columns are named A2 to G2) and are not loaded into the dataset. Here’s the M code for the query called First:


let
Source = Csv.Document(
File.Contents("C:\Users\chwebb\Documents\NumbersMoreColumns.csv"),
[Delimiter = ",", Columns = 7, Encoding = 65001, QuoteStyle = QuoteStyle.None]
),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
#"Changed Type" = Table.TransformColumnTypes(
#"Promoted Headers",
{
{"A", Int64.Type},
{"B", Int64.Type},
{"C", Int64.Type},
{"D", Int64.Type},
{"E", Int64.Type},
{"F", Int64.Type},
{"G", Int64.Type}
}
),
#"Renamed Columns" = Table.RenameColumns(
#"Changed Type",
{{"A", "A1"}, {"B", "B1"}, {"C", "C1"}, {"D", "D1"}, {"E", "E1"}, {"F", "F1"}, {"G", "G1"}}
)
in
#"Renamed Columns"

First

Here’s a query that merges these queries using Table.NestedJoin and returns all columns from the source queries and is enabled:


let
Source = Table.NestedJoin(First, {"A1"}, Second, {"A2"}, "Second", JoinKind.Inner),
#"Expanded Second" = Table.ExpandTableColumn(
Source,
"Second",
{"A2", "B2", "C2", "D2", "E2", "F2", "G2"},
{"Second.A2", "Second.B2", "Second.C2", "Second.D2", "Second.E2", "Second.F2", "Second.G2"}
)
in
#"Expanded Second"

The timings for refreshing the merge query are:

  • Progress Report End/25 Execute SQL – 54 seconds
  • Progress Report End/17 Read Data – 58 seconds

[As I mentioned before, these timings may vary by a few seconds each way from run to run because paging is taking place]

Now it just so happens that in this case I know the data in the A columns in both the source queries is sorted in ascending order, so this means I can rewrite the merge query using Table.Join and the SortMerge algorithm like so:


let
Source =
Table.Join(
First, {"A1"}, Second, {"A2"},
JoinKind.Inner, JoinAlgorithm.SortMerge
)
in
Source

Here are the timings for this version:

  • Progress Report End/25 Execute SQL – 0.1 seconds
  • Progress Report End/17 Read Data – 21 seconds

This new query is clearly much faster than the original version of the merge!

Now let’s talk about the limitations of this approach. First of all, if the data in the columns used to join the two tables together is not sorted in ascending order you won’t get an error message, you’ll just get incorrect data returned, so you really need to be sure that the data is indeed sorted. Secondly, if your data is not sorted, then you can sort it in Power Query before the merge – but since sorting itself takes time and sorting for non-foldable data sources is another one of those operations which requires the table to be held in memory, you’re unlikely to get any performance improvement.

That said, I can see that there are going to be plenty of scenarios where you can use this technique. For example, if you’re extracting data to a CSV file it may be possible to specify that you want to sort the data when the extract takes place. Also, if you’re trying to merge data from two different data sources (say, SQL Server and Oracle) that both support query folding for sorting, then the time it takes to sort the data may be less than the performance gain from using the SortMerge algorithm.

 

 

Speed Up Data Refresh Performance In Power BI Desktop Using Table.View

It can sometimes be frustrating to work with slow data sources or complex Power Query queries in Power BI Desktop: you open the Power Query Editor, make some changes, click Close & Apply and then wait a loooong time for your data to refresh. In this post I’m going to show you a technique that can cut this wait by up to 50%. It involves some fairly complex M code but I promise you, the effort is worth it!

In fact, what I’m going to describe is more or less what I showed towards the end of my appearance on Guy In A Cube last year and in a few other posts, but at that time I didn’t understand properly why it worked or what the performance implications actually were. Now, thanks to a lot of help from Curt Hagenlocher of the Power Query development team I have all the details I need to blog about it.

Let’s see a simple example. Say you have a large JSON file – for this test I generated one that is 67MB containing random data using this handy online tool – and you want to load it into Power BI. You’ll end up with a query that looks something like this:

let
  Source = Json.Document(File.Contents("C:\generated.json")),
  #"Converted to Table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ),
  #"Expanded Column1" = Table.ExpandRecordColumn(
    #"Converted to Table", 
    "Column1", 
    {"_id", "index", "guid", "isActive", "balance"}, 
    {"_id", "index", "guid", "isActive", "balance"}
  ),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Expanded Column1", 
    {
      {"_id", type text}, 
      {"index", Int64.Type}, 
      {"guid", type text}, 
      {"isActive", type text}, 
      {"balance", type text}
    }
  )
in
  #"Changed Type"

Here’s what the output of this query looks like:

Output

On my laptop this query takes around 7-8 seconds to load. Using Process Monitor to see how much data is being read from the file (see this post for more details) shows that Power Query is reading the data from this file twice. This graph, generated using data from Process Monitor, has time on the x axis and amount of data read from the file on the y axis, and the two peaks indicate that the data is being read twice:

FirstLoad

Why? When you refresh a table in Power BI Desktop two things happen:

  • First, Power BI has to check what columns are present in the table. To do this, it runs the query for the table but asks the Power Query engine to filter the table so it returns zero rows.
  • Secondly the query is run again but this time with no filter, so all rows are returned, and this is when the data is actually loaded into the table in Power BI.

It’s on this first run of the query where problems can occur. If you’re using a data source like SQL Server and you’re not doing any complex transformations then query folding will take place, so Power BI’s request to return the table but with no rows can be handled very efficiently. However, if you’re using a data source where query folding is not possible (such as Excel, CSV or JSON files) or you have complex transformations that stop folding taking place then the only way Power Query can work out what columns the query returns is by running the entire query. That’s what is happening in the example above.

This only happens in Power BI Desktop, though. When you refresh a dataset in the Power BI Service the query is only run once.

The solution is to trick the Power Query engine into running these zero-row filter queries instantly, and you can do that using the Table.View M function. I blogged about Table.View here (and I strongly suggest you read that post before carrying on) but that example didn’t quite get to the solution you need here. Here’s a new version of the query above with an extra step:

let
  Source = Json.Document(File.Contents("C:\generated.json")),
  #"Converted to Table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ),
  #"Expanded Column1" = Table.ExpandRecordColumn(
    #"Converted to Table", 
    "Column1", 
    {"_id", "index", "guid", "isActive", "balance"}, 
    {"_id", "index", "guid", "isActive", "balance"}
  ),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Expanded Column1", 
    {
      {"_id", type text}, 
      {"index", Int64.Type}, 
      {"guid", type text}, 
      {"isActive", type text}, 
      {"balance", type text}
    }
  ),
  OverrideZeroRowFilter = Table.View(
    null, 
    [
    GetType = () => 
      type table[
      _id = Text.Type, 
      index = Int64.Type, 
      guid = Text.Type, 
      isAction = Text.Type, 
      balance = Text.Type
    ], 
    GetRows = () => 
      #"Changed Type", 
    OnTake = (count as number) => 
      if count = 0 then 
      #table(
      type table[
        _id = Text.Type, 
        index = Int64.Type, 
        guid = Text.Type, 
        isAction = Text.Type, 
        balance = Text.Type
      ], 
      {}
    ) 
    else 
     Table.FirstN(#"Changed Type", count)]
  )
in
  OverrideZeroRowFilter

The OverrideZeroRowFilter step is where the magic happens. It uses Table.View to override query folding behaviour by intercepting what happens when the table returned by the query is filtered. There are three fields in the record in the second parameter of Table.View that you need to change:

  • GetType returns a table type that describes the columns and their data types present in the output of the query. The six columns listed here are the six columns you can see in the screenshot of the query output above. It’s very easy to generate the required M code when you use the custom function that I blogged about here.
  • GetRows returns all the rows that the query can return, that’s to say the table returned by the #”Changed Type” step.
  • OnTake is used when a top n filter is applied to the table returned by the query.  In this case the code looks at the number of rows that are being requested (specified in the count parameter) and if that’s zero, it just returns an empty table with the same columns as the #”Changed Type” step; if it’s more than zero then it uses Table.FirstN to get the actual number of rows requested from #”Changed Type”. This means that when Power BI does that zero-row filter it can now happen immediately because there’s no need to go back to the data source or execute any of the transformations in the query.

This version of the query now runs in 4-5 seconds, and Process Monitor now shows that the JSON file is only read once and, obviously, reading the data once is a lot faster than reading it twice:

SecondLoad

Look at how the graph flattens at the end… but that’s something for another blog post.

There is a downside to this approach: you have to hard-code the table schema that you expect your query to return, and if you change your query to return different columns you’ll have to update the table type in the last step.

I’ve used this technique on other slow queries and it has made a massive difference to the development experience in Power BI Desktop – one query that was taking five minutes to refresh when I closed the Power Query Editor went down to two and a half minutes. If you try this out yourself please let me know if it makes a difference by leaving a comment.

[Don’t forget that there are other things you can do that will also speed up the development experience for slow queries: you should definitely turn off the “Allow data preview to download in the background” option, and you might want to consider turning off data privacy checks so long as you fully understand what the implications]

 

%d bloggers like this: