A Function To Visualise Parallelism In Power Query Diagnostics Data

Most of the time I’ve spent looking at Power Query Query Diagnostics data, I’ve been looking at data for a single query. Over the past few days though I’ve spent some time investigating what this data shows for all the queries that are executed for a single dataset refresh. To help me do this I wrote the following M function:

(InputTable as table) => 
  let
    Source = InputTable,
    EarliestStart = List.Min(Source[Start Time]),
    AddRelativeStart = Table.AddColumn(
        Source, 
        "Relative Start", 
        each [Start Time] - EarliestStart
      ),
    AddRelativeEnd = Table.AddColumn(
        AddRelativeStart, 
        "Relative End", 
        each [End Time] - EarliestStart
      ),
    SetToDurations = Table.TransformColumnTypes(
        AddRelativeEnd, 
        {
          {"Relative Start", type duration}, 
          {"Relative End", type duration}
        }
      ),
    CalculatedTotalSeconds
      = Table.TransformColumns(
          SetToDurations, 
          {
            {
              "Relative Start", 
              Duration.TotalSeconds, 
              type number
            }, 
            {
              "Relative End", 
              Duration.TotalSeconds, 
              type number
            }
          }
        ),
    GroupedRows = Table.Group(
        CalculatedTotalSeconds, 
        {"Id", "Query"}, 
        {
          {
            "Relative Start", 
            each List.Min([Relative Start]), 
            type number
          }, 
          {
            "Relative End", 
            each List.Max([Relative End]), 
            type number
          }
        }
      ),
    MergeColumns = Table.AddColumn(
        GroupedRows, 
        "Id Query", 
        each Text.Combine({[Id], [Query]}, " "), 
        type text
      ),
    IdToNumber = Table.TransformColumnTypes(
        MergeColumns, 
        {{"Id", type number}}
      )
  in
    IdToNumber

If you invoke this function and pass in the “Detailed” query diagnostics data query:

…you’ll end up with a query that gives you a table that looks something like this:

This table has one row for each Power Query query that got executed while the diagnostics trace was running, an Id column to uniquely identify each execution, the name of the query executed, an Id Query column that concatenates the previous two columns, and Relative Start and Relative End columns that give you the number of seconds from the start time of the first query executed up to the start time and end time of the query on the current row.

Loading the data into Power BI allows you to build a stacked bar chart with Id Query on the axis and Relative Start and Relative End together in the values:

If you then set the Sort By Column property of Id Query to the Id column (which is numeric, so this ensures that the values in Id Query are sorted correctly) and set the Data Color property of the Relative Start values to white (or whatever the background of your chart is) so that it’s invisible, then finally set the axis of your stacked bar chart to be sorted by Id Query in ascending order, you get a visual something like this:

This is a kind of Gantt chart where the x axis shows the number of seconds since the start of the first query execution and each bar shows the start and end times of each query, making it easy to see which queries are executing in parallel.

“What can this show us?” I hear you say. Well, that’s something for a future blog post. All that I can say right now is that it’s COMPLICATED and I don’t properly understand it myself yet. But it’s undoubtedly interesting.

Optimising The Performance Of Power Query Merges In Power BI, Part 5: Cross Joins

In a late addition to the series of posts that I started here a few months ago, I thought it would be useful to look at how to optimise the performance of a rare (but nonetheless useful) type of merge: the cross join. Cross joins are not available as a join kind in the Table.Join or Table.NestedJoin M functions but there are two ways of creating them in Power Query:

  1. Creating a custom column in your query that contains the query that you want to cross join with and then expanding, as Matt Allington describes here
  2. Creating a custom column that contains a single value, such as the number 1, on both your queries and then using that column to merge the two queries on (something I mentioned a long time ago here)

Which one gives you the best performance? Apart from my old blog post there’s at least one other post suggesting the second approach is much faster, but I thought it would be useful to do some proper testing.

For my source data I created a query called SourceQuery that returned a table with one thousand rows, containing all the integers from 1 to 1000:

To test the first approach, I created a second query that referenced SourceQuery, then added a custom column that also returned the SourceQuery query:

…and then expanded this new custom column. The result was a query that returned a million rows – the cross join of the thousand-row SourceQuery table with itself.

Here’s the M code:

let
  Source = SourceQuery,
  #"Added Custom" = Table.AddColumn(
      Source, 
      "Custom", 
      each SourceQuery
    ),
  #"Expanded Custom" = Table.ExpandTableColumn(
      #"Added Custom", 
      "Custom", 
      {"Column1"}, 
      {"Custom.Column1"}
    ),
  #"Changed Type" = Table.TransformColumnTypes(
      #"Expanded Custom", 
      {{"Custom.Column1", Int64.Type}}
    )
in
  #"Changed Type"

The two timings from Profiler (the same ones that I have used to measure refresh performance throughout this series) were:

  • Progress Report End/25 Execute SQL – 0.03 seconds
  • Progress Report End/17 Read Data – 3.80 seconds

To test the second approach I added a custom column to the SourceQuery query that contained the value 1:

…and then created a new query that joined this query to itself:

…and then did an expand to achieve the same cross joined output as in the first test. Here’s the M code (all created in the Power Query Editor – no custom M code required):

let
  Source = Table.NestedJoin(
      #"SourceQuery With Join Column", 
      {"JoinColumn"}, 
      #"SourceQuery With Join Column", 
      {"JoinColumn"}, 
      "SourceQuery With Join Column", 
      JoinKind.Inner
    ),
  #"Expanded SourceQuery With Join Column"
    = Table.ExpandTableColumn(
        Source, 
        "SourceQuery With Join Column", 
        {"Column1"}, 
        {"SourceQuery With Join Column.Column1"}
      ),
  #"Removed Columns" = Table.RemoveColumns(
      #"Expanded SourceQuery With Join Column", 
      {"JoinColumn"}
    )
in
  #"Removed Columns"

The performance of this query was much faster:

  • Progress Report End/25 Execute SQL – 0.03 seconds
  • Progress Report End/17 Read Data – 0.80 seconds

So, as you can see, the second approach is the one to use.

There’s another advantage of this second approach too, if you’re using a foldable data source like SQL Server: it is possible to make query folding happen, which is of course incredibly important from a performance point of view, although you have to be careful not to change the data type of your custom column (or at least be careful how you do it). For example, here’s the M for a query that gets the cross join of the DimScenario table from the Adventure Works DW database with itself:

let
  Source = Sql.Databases("localhost"),
  AdventureWorksDW2017 = Source{[Name
    = "AdventureWorksDW2017"]}[Data],
  dbo_DimScenario = AdventureWorksDW2017{[
    Schema = "dbo", 
    Item = "DimScenario"
  ]}[Data],
  #"Removed Columns" = Table.RemoveColumns(
      dbo_DimScenario, 
      {"FactFinance"}
    ),
  #"Added Custom" = Table.AddColumn(
      #"Removed Columns", 
      "Custom", 
      each 1
    ),
  #"Merged Queries" = Table.NestedJoin(
      #"Added Custom", 
      {"Custom"}, 
      #"Added Custom", 
      {"Custom"}, 
      "Added Custom", 
      JoinKind.Inner
    ),
  #"Expanded Added Custom"
    = Table.ExpandTableColumn(
        #"Merged Queries", 
        "Added Custom", 
        {"ScenarioKey", "ScenarioName", "Custom"}, 
        {
          "Added Custom.ScenarioKey", 
          "Added Custom.ScenarioName", 
          "Added Custom.Custom"
        }
      ),
  #"Removed Other Columns" = Table.SelectColumns(
      #"Expanded Added Custom", 
      {
        "ScenarioName", 
        "Added Custom.ScenarioName"
      }
    )
in
  #"Removed Other Columns"

Here’s the SQL code generated by this query:

select [$Outer].[ScenarioName] as [ScenarioName],
    [$Inner].[ScenarioName2] as [Added Custom.ScenarioName]
from 
(
    select [_].[ScenarioKey] as [ScenarioKey],
        [_].[ScenarioName] as [ScenarioName],
        1 as [Custom]
    from [dbo].[DimScenario] as [_]
) as [$Outer]
inner join 
(
    select [_].[ScenarioKey] as [ScenarioKey2],
        [_].[ScenarioName] as [ScenarioName2],
        1 as [Custom2]
    from [dbo].[DimScenario] as [_]
) as [$Inner] on ([$Outer].[Custom] = [$Inner].[Custom2])

Displaying Routes On A Map In Power BI

In last week’s post I described the new Power Query M functions for working with Well Known Text objects, and in a post a few weeks ago I showed how you can use the Icon Map custom visual to display Well Known Text data. In this post I’ll show you how you can put all this together to do something really useful: display routes on a map in a Power BI report.

First of all you’ll need to download the latest version of the Icon Map custom visual here, because at the time of writing the version in AppSource doesn’t have the WKT functionality needed. Second, you’ll need a data source with a table of latitudes and longitudes that represent points on a route that you want to plot. I’m going to use the Azure Maps Get Route Directions API as my data source; to keep things simple I’m going to use the Shared Key authentication method for this API, but I’ll explain below why this can be a problem in the real world. To follow the example you’ll need to create an Azure Maps Account in the Azure Portal so you can get your own key.

Here’s a simple example to start off with. The following M code:

let
  Source = Json.Document(
      Web.Contents(
          "https://atlas.microsoft.com/route/directions/json", 
          [ApiKeyName = "subscription-key", Query
            = [
            #"api-version" = "1.0", 
            query
              = "51.502243,-0.140073:51.517777,-0.138621", 
            travelMode = "pedestrian"
          ]]
        )
    ),
  routes = Source[routes],
  routes1 = routes{0},
  legs = routes1[legs],
  legs1 = legs{0},
  points = legs1[points],
  PointsList = List.Transform(
      points, 
      each GeographyPoint.From(
          [longitude], 
          [latitude]
        )
    ),
  LineRecord = [Kind = "LINESTRING", Points
    = PointsList],
  WKTLineString = Geography.ToWellKnownText(
      LineRecord
    ),
  ToTable = #table(
      type table[
        Category = number, 
        WKT = text, 
        Size = number
      ], 
      {{1, WKTLineString, 1}}
    )
in
  ToTable

…calls the Azure Maps Get Route Directions API to get a walking route between two locations in central London. When you run the query you’ll be prompted to authenticate – choose the Web API option and enter your key. It’s a very basic example of how the API can be used: I have assumed there’s only one leg in the route, then taken the list of latitudes and longitudes for that single leg, turned that into a list of points and then turned that list into a linestring as I showed in last week’s blog post. Here’s the output in the Icon Map visual:

This is what the table returned by the query looks like:

The Category and Size columns aren’t very interesting; Icon Map needs these columns to display data. The WKT column contains the linestring object. Here’s how the visual is set up in Power BI:

We can do better than this though. In the example pbix file I’ve uploaded here, I’ve created a more sophisticated query that asks for six different routes between the two locations used above that all go via one waypoint (so there are two legs in the route data returned). I won’t go through the M code because there’s nothing interesting in it, but the end result is a report that allows you to plot these six different routes in different colours on a map and use a slicer to hide and show them, along with some information on their length and travel time:

One last thing to point out regarding these examples is that they won’t refresh if you publish them to the Power BI Service because I’ve used the ApiKeyName option in Web.Contents to handle key authentication (see here) and that isn’t supported in the Service yet. To work around this you’ll need to hard code the Azure Maps API key in your query and use the Skip Test Connection option in the Power BI Service.

Adding Tooltips To Column Headers In The Power Query Editor In Power BI And Excel

Here’s something that will excite all you M nerds out there! Did you know you can make tooltips appear when you hover over column headers in a table in the Power Query Editor in either Power BI Desktop or Excel?

It’s all done with M metadata, and here’s an example query:


let
source =
#table({"firstname", "lastname"}, {{"matt", "masson"}}),
tableType =
type table[firstname = Text.Type, lastname = Text.Type]
meta [
Documentation.FieldDescription =
[firstname = "Given Name", lastname = "Family Name"]
],
replaceType = Value.ReplaceType(source, tableType)
in
replaceType

Here’s what you see when your mouse hovers over the firstname column in the Power Query Editor:

Capture1

…and here’s what you see when your mouse hovers over the lastname column:

Capture2

How does this work? Here’s what each of the steps do:

  • The source step creates a simple table with two columns called firstname and lastname using #table (see here for more details on that).
  • The tabletype step declares a new table type with two columns (the same two columns in the table from the previous step) and then adds a metadata record to this type. In that record the Documentation.FieldDescription field contains the text values that will appear as tooltips when you hover over each column.
  • The replacetype step replaces the type of the table returned by source with the type declared in tabletype.

The Power Query Editor UI then looks for a Documentation.FieldDescription field in any metadata associated with a table and displays the values in that field when you hover over the appropriate column.

Note that if you add any steps to your query after this that change the table type (for example that add or remove columns), the metadata is removed ☹. That said I still feel like this might be a useful feature for anyone building a custom connector, for example.

[Thanks to Matt Masson for telling me how all this works]

 

 

Optimising The Performance Of Power Query Merges In Power BI, Part 3: Table.Join And SortMerge

In the last two posts in this series I showed how removing columns from the tables used in a Power Query merge operation can improve refresh performance. In this post I’ll show you a different trick that – when you can use it – can give you an equally significant performance boost.

When you merge data from two queries in the Power Query Editor the M code generated uses the Table.NestedJoin function. There is, however, another M function that can be used to merge data: Table.Join. The interesting thing about this function is that has a parameter that Table.NestedJoin doesn’t have: the joinAlgorithm parameter allows you to specify the algorithm used by the Power Query engine for the merge. The documentation doesn’t tell you what the possible values for this parameter are but #shared shows them:

Algs

The SortMerge algorithm, last in the list above, is the focus of this blog post. I mentioned in my earlier posts that the reason that merge operations on non-foldable data sources are often slow is that both of the tables used in the merge need to be held in memory. There is an exception though: if you know that the data in the columns used to join the two tables is sorted in ascending order, you can use the Table.Join function and the SortMerge algorithm and the data from both sources can be streamed rather than held in memory, which in turn results in the merge being much faster.

Here’s an example. As before there are two source queries that take data from a one million row CSV file, but for this test no columns are removed and there’s no filter on rows. The two source queries, called First and Second, are almost identical (in First the columns are named A1 to G1 and in Second the columns are named A2 to G2) and are not loaded into the dataset. Here’s the M code for the query called First:


let
Source = Csv.Document(
File.Contents("C:\Users\chwebb\Documents\NumbersMoreColumns.csv"),
[Delimiter = ",", Columns = 7, Encoding = 65001, QuoteStyle = QuoteStyle.None]
),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
#"Changed Type" = Table.TransformColumnTypes(
#"Promoted Headers",
{
{"A", Int64.Type},
{"B", Int64.Type},
{"C", Int64.Type},
{"D", Int64.Type},
{"E", Int64.Type},
{"F", Int64.Type},
{"G", Int64.Type}
}
),
#"Renamed Columns" = Table.RenameColumns(
#"Changed Type",
{{"A", "A1"}, {"B", "B1"}, {"C", "C1"}, {"D", "D1"}, {"E", "E1"}, {"F", "F1"}, {"G", "G1"}}
)
in
#"Renamed Columns"

First

Here’s a query that merges these queries using Table.NestedJoin and returns all columns from the source queries and is enabled:


let
Source = Table.NestedJoin(First, {"A1"}, Second, {"A2"}, "Second", JoinKind.Inner),
#"Expanded Second" = Table.ExpandTableColumn(
Source,
"Second",
{"A2", "B2", "C2", "D2", "E2", "F2", "G2"},
{"Second.A2", "Second.B2", "Second.C2", "Second.D2", "Second.E2", "Second.F2", "Second.G2"}
)
in
#"Expanded Second"

The timings for refreshing the merge query are:

  • Progress Report End/25 Execute SQL – 54 seconds
  • Progress Report End/17 Read Data – 58 seconds

[As I mentioned before, these timings may vary by a few seconds each way from run to run because paging is taking place]

Now it just so happens that in this case I know the data in the A columns in both the source queries is sorted in ascending order, so this means I can rewrite the merge query using Table.Join and the SortMerge algorithm like so:


let
Source =
Table.Join(
First, {"A1"}, Second, {"A2"},
JoinKind.Inner, JoinAlgorithm.SortMerge
)
in
Source

Here are the timings for this version:

  • Progress Report End/25 Execute SQL – 0.1 seconds
  • Progress Report End/17 Read Data – 21 seconds

This new query is clearly much faster than the original version of the merge!

Now let’s talk about the limitations of this approach. First of all, if the data in the columns used to join the two tables together is not sorted in ascending order you won’t get an error message, you’ll just get incorrect data returned, so you really need to be sure that the data is indeed sorted. Secondly, if your data is not sorted, then you can sort it in Power Query before the merge – but since sorting itself takes time and sorting for non-foldable data sources is another one of those operations which requires the table to be held in memory, you’re unlikely to get any performance improvement.

That said, I can see that there are going to be plenty of scenarios where you can use this technique. For example, if you’re extracting data to a CSV file it may be possible to specify that you want to sort the data when the extract takes place. Also, if you’re trying to merge data from two different data sources (say, SQL Server and Oracle) that both support query folding for sorting, then the time it takes to sort the data may be less than the performance gain from using the SortMerge algorithm.

 

 

Speed Up Data Refresh Performance In Power BI Desktop Using Table.View

It can sometimes be frustrating to work with slow data sources or complex Power Query queries in Power BI Desktop: you open the Power Query Editor, make some changes, click Close & Apply and then wait a loooong time for your data to refresh. In this post I’m going to show you a technique that can cut this wait by up to 50%. It involves some fairly complex M code but I promise you, the effort is worth it!

In fact, what I’m going to describe is more or less what I showed towards the end of my appearance on Guy In A Cube last year and in a few other posts, but at that time I didn’t understand properly why it worked or what the performance implications actually were. Now, thanks to a lot of help from Curt Hagenlocher of the Power Query development team I have all the details I need to blog about it.

Let’s see a simple example. Say you have a large JSON file – for this test I generated one that is 67MB containing random data using this handy online tool – and you want to load it into Power BI. You’ll end up with a query that looks something like this:

let
  Source = Json.Document(File.Contents("C:\generated.json")),
  #"Converted to Table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ),
  #"Expanded Column1" = Table.ExpandRecordColumn(
    #"Converted to Table", 
    "Column1", 
    {"_id", "index", "guid", "isActive", "balance"}, 
    {"_id", "index", "guid", "isActive", "balance"}
  ),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Expanded Column1", 
    {
      {"_id", type text}, 
      {"index", Int64.Type}, 
      {"guid", type text}, 
      {"isActive", type text}, 
      {"balance", type text}
    }
  )
in
  #"Changed Type"

Here’s what the output of this query looks like:

Output

On my laptop this query takes around 7-8 seconds to load. Using Process Monitor to see how much data is being read from the file (see this post for more details) shows that Power Query is reading the data from this file twice. This graph, generated using data from Process Monitor, has time on the x axis and amount of data read from the file on the y axis, and the two peaks indicate that the data is being read twice:

FirstLoad

Why? When you refresh a table in Power BI Desktop two things happen:

  • First, Power BI has to check what columns are present in the table. To do this, it runs the query for the table but asks the Power Query engine to filter the table so it returns zero rows.
  • Secondly the query is run again but this time with no filter, so all rows are returned, and this is when the data is actually loaded into the table in Power BI.

It’s on this first run of the query where problems can occur. If you’re using a data source like SQL Server and you’re not doing any complex transformations then query folding will take place, so Power BI’s request to return the table but with no rows can be handled very efficiently. However, if you’re using a data source where query folding is not possible (such as Excel, CSV or JSON files) or you have complex transformations that stop folding taking place then the only way Power Query can work out what columns the query returns is by running the entire query. That’s what is happening in the example above.

This only happens in Power BI Desktop, though. When you refresh a dataset in the Power BI Service the query is only run once.

The solution is to trick the Power Query engine into running these zero-row filter queries instantly, and you can do that using the Table.View M function. I blogged about Table.View here (and I strongly suggest you read that post before carrying on) but that example didn’t quite get to the solution you need here. Here’s a new version of the query above with an extra step:

let
  Source = Json.Document(File.Contents("C:\generated.json")),
  #"Converted to Table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ),
  #"Expanded Column1" = Table.ExpandRecordColumn(
    #"Converted to Table", 
    "Column1", 
    {"_id", "index", "guid", "isActive", "balance"}, 
    {"_id", "index", "guid", "isActive", "balance"}
  ),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Expanded Column1", 
    {
      {"_id", type text}, 
      {"index", Int64.Type}, 
      {"guid", type text}, 
      {"isActive", type text}, 
      {"balance", type text}
    }
  ),
  OverrideZeroRowFilter = Table.View(
    null, 
    [
    GetType = () => 
      type table[
      _id = Text.Type, 
      index = Int64.Type, 
      guid = Text.Type, 
      isAction = Text.Type, 
      balance = Text.Type
    ], 
    GetRows = () => 
      #"Changed Type", 
    OnTake = (count as number) => 
      if count = 0 then 
      #table(
      type table[
        _id = Text.Type, 
        index = Int64.Type, 
        guid = Text.Type, 
        isAction = Text.Type, 
        balance = Text.Type
      ], 
      {}
    ) 
    else 
     Table.FirstN(#"Changed Type", count)]
  )
in
  OverrideZeroRowFilter

The OverrideZeroRowFilter step is where the magic happens. It uses Table.View to override query folding behaviour by intercepting what happens when the table returned by the query is filtered. There are three fields in the record in the second parameter of Table.View that you need to change:

  • GetType returns a table type that describes the columns and their data types present in the output of the query. The six columns listed here are the six columns you can see in the screenshot of the query output above. It’s very easy to generate the required M code when you use the custom function that I blogged about here.
  • GetRows returns all the rows that the query can return, that’s to say the table returned by the #”Changed Type” step.
  • OnTake is used when a top n filter is applied to the table returned by the query.  In this case the code looks at the number of rows that are being requested (specified in the count parameter) and if that’s zero, it just returns an empty table with the same columns as the #”Changed Type” step; if it’s more than zero then it uses Table.FirstN to get the actual number of rows requested from #”Changed Type”. This means that when Power BI does that zero-row filter it can now happen immediately because there’s no need to go back to the data source or execute any of the transformations in the query.

This version of the query now runs in 4-5 seconds, and Process Monitor now shows that the JSON file is only read once and, obviously, reading the data once is a lot faster than reading it twice:

SecondLoad

Look at how the graph flattens at the end… but that’s something for another blog post.

There is a downside to this approach: you have to hard-code the table schema that you expect your query to return, and if you change your query to return different columns you’ll have to update the table type in the last step.

I’ve used this technique on other slow queries and it has made a massive difference to the development experience in Power BI Desktop – one query that was taking five minutes to refresh when I closed the Power Query Editor went down to two and a half minutes. If you try this out yourself please let me know if it makes a difference by leaving a comment.

[Don’t forget that there are other things you can do that will also speed up the development experience for slow queries: you should definitely turn off the “Allow data preview to download in the background” option, and you might want to consider turning off data privacy checks so long as you fully understand what the implications]

 

Power Query Formatter

If you’ve ever used DAX Formatter to format your DAX code and wondered why there isn’t an equivalent for Power Query/M, then wonder no more: the nice people at Power Pivot Insights in Germany (read their German-language blog here) have built one. You can find it here:

https://powerqueryformatter.com/

PQ Formatter

They also have an API you can use too.

Bonus information: did you know that Microsoft also has an open source parser for M, available here?

Understanding The “The key didn’t match any rows in the table” Error In Power Query In Power BI Or Excel

One of the most common errors you’ll see when working with Power Query in Power BI or Excel is this:

Expression.Error: The key didn’t match any rows in the table

KeyErrorScreenshot

It can occur with almost any data source and in a wide variety of different circumstances, and for new users of Power Query it can be very confusing. In this post I’ll explain what the error message means and when you’re likely to encounter it using a simple example.

TL;DR You’re probably getting this error because your Power Query query is trying to connect to a table or worksheet or something in your data source that has been deleted or been renamed.

Say you have an Excel workbook with three worksheets in called Sheet1, Sheet2 and Sheet3:

WorksheetsBefore

On Sheet1 there’s some data you want to load into Power BI or Excel using Power Query:

Data

It’s very easy to do this: after you click on the Excel data source and select your Excel workbook, you see the Navigator window with the contents of the workbook and select Sheet1:

Nav

…click OK and your query is ready:

PQQuery

What actually happens in this query though? If you click on each of the four steps in the Applied Steps pane on the right-hand side of the screen, you’ll see what each step does. In particular, notice that the first step in the query (called Source) returns a table with one row for each worksheet, table and named range in the workbook:

SourceStep

This is something that happens with many types of data source: after you have connected you have to choose what data inside the data source you want to connect to in the first step, and the list of things you can connect to is returned in a table. This table lists the same things displayed in the Navigator dialog shown above. In the same way, if you connect to a SQL Server database this first step returns a table with all the tables and views in the database; if you connect to an OData feed the first step shows all the resources available through the feed.

The other thing to notice in the table in the screenshot is the Data column, which contains nested table values. If you were to click on one of these nested tables you would navigate to the data in the worksheet named on that row and that is in fact what the second step in the query does.

Here’s the M code for the whole query:


let
Source =
Excel.Workbook(
File.Contents(
"C:\MyExcelFile.xlsx"),
null,
true),
Sheet1_Sheet =
Source{[Item="Sheet1",Kind="Sheet"]}[Data],
#"Promoted Headers" =
Table.PromoteHeaders(
Sheet1_Sheet,
[PromoteAllScalars=true]
),
#"Changed Type" =
Table.TransformColumnTypes(
#"Promoted Headers",
{{"Month", type text},
{"Sales", Int64.Type}}
)
in
#"Changed Type"

If you want to understand how the code in the second step of this query gets the contents of the cell containing the nested table, I have a blog post here that goes into a lot more detail. Basically the second step (called Sheet1_Sheet in the code, but shown as Navigation in the Applied Steps pane) returns the nested table in the Data column from the row in the table where the Item column contains the value “Sheet1” and the Kind column contains the value “Sheet”. It’s the contents of the Item and Kind columns that are used to identify the row in the table that contains the data you want to see – in database terminology these columns are the key columns on the table.

What happens if you change the name of the worksheet that contains the data you want, from “Sheet1” to “Hello”?

WorksheetsAfter

The result is that the first step of the query now returns a table that looks like this:

AfterNav

Since there isn’t a row where the Item column contains “Sheet1” any more, the second step can no longer find the row it’s looking for – the key value it’s looking for no longer exists in the table – which is why you see the error message you do:

KeyErrorScreenshot

There are two ways to fix the error:

  1. Go to the Excel workbook and change the name of the sheet called “Hello” back to “Sheet1” or
  2. In either the Advanced Editor or the Formula Bar edit the M code and replace the reference to “Sheet1” with “Hello” in the second step of the query:
    FormulaBar

In summary, when you’re working with data sources in Power Query you need to be aware that if something in your data source is renamed or deleted then your query will break. You can of course write some clever M code to deal with situations like this but my recommendation is to try to fix the problem in the data source and not in your query.

An M Function To Help You Explore Power Query Diagnostics Data

Last week’s post showed an M function that took Power Query diagnostics data and formatted in a way that made it suitable for visualisation in a Power BI Decomposition Tree visual. This is great for understanding what’s going on at a high level, but by doing this you also lose a lot of detailed information from the diagnostics logs that could be useful for performance tuning. This week I have an M function for you that takes a different approach to solving the same problem that might be more appealing for the hardcore Power Query fans out there.

Here’s the code for the function:

(Input as table) as table =>
let
Source = Input,
#"Expanded Additional Info" = Table.ExpandRecordColumn(Source, "Additional Info", {"Message"}, {"Message"}),
#"Calculated Total Seconds" = Table.TransformColumns(#"Expanded Additional Info",{{"Exclusive Duration", Duration.TotalSeconds, type number}}),
#"Sorted Rows" = Table.Sort(#"Calculated Total Seconds",{{"Id", Order.Ascending},{"Start Time", Order.Ascending}}),
#"Removed Other Columns" = Table.SelectColumns(#"Sorted Rows",{"Id", "Query", "Category", "Operation", "Start Time", "End Time", "Exclusive Duration (%)", "Exclusive Duration", "Data Source Query", "Message", "Row Count", "Content Length", "Path", "Group Id"}),
#"Changed Type" = Table.TransformColumnTypes(#"Removed Other Columns",{{"Message", type text}}),
#"Replaced Value" = Table.ReplaceValue(#"Changed Type",null,"Missing",Replacer.ReplaceValue,{"Path"}),
BufferedTable = Table.Buffer(#"Replaced Value"),
GetAllChildRows = (CurrentId, CurrentPath) =>
Table.SelectRows(BufferedTable, each [Path]<>"Missing" and [Id]=CurrentId and Text.StartsWith([Path], CurrentPath)),
AddTotalED = Table.AddColumn(#"Replaced Value", "Exclusive Duration (Including Child Operations)", each List.Sum(GetAllChildRows([Id],[Path])[Exclusive Duration]), type number),
AddTotalEDPct = Table.AddColumn(AddTotalED, "Exclusive Duration (%) (Including Child Operations)", each List.Sum(GetAllChildRows([Id],[Path])[#"Exclusive Duration (%)"]), Percentage.Type),
#"Inserted Text Before Delimiter" = Table.AddColumn(AddTotalEDPct, "Parent Path", each Text.BeforeDelimiter([Path], "/", {0, RelativePosition.FromEnd}), type text),
#"Added Custom" = Table.AddColumn(#"Inserted Text Before Delimiter", "Child Operations", each
let
CurrentPath = [Path],
CurrentId = [Id],
ChildRows =
Table.SelectRows(
@#"Added Custom",
each
[Path]<>"Missing" and [Parent Path]=CurrentPath and [Id] = CurrentId
),
Output = if Table.RowCount(ChildRows)=0 then null else ChildRows
in
Output),
#"Filtered Rows" = Table.SelectRows(#"Added Custom", each ([Path] = "0" or [Path]="Missing")),
#"Removed Columns" = Table.RemoveColumns(#"Filtered Rows",{"Parent Path"})
in
#"Removed Columns"

As before you need to pass a query created by the Power Query diagnostics functionality to the function:

InvokeFunction

[Note: I recommend calling the function on the “detailed” diagnostics query and not the summarised view, because I’ve found what might be a bug with the summarised view]

The function will then return a table with one row for each query that is evaluated while the diagnostics were running:

QueryList

[Note: You may find that some query evaluations have nulls instead of the name of the query; that’s an issue with the diagnostics data that will be addressed soon]

You could get the same thing simply by filtering the raw diagnostics data down to the rows where the Path column contains 0, but I’ve added two extra things that make the data much easier to understand.

First of all, the original data contains columns called “Exclusive Duration” and “Exclusive Duration (%)”, which tell you the amount of time each operation took in seconds and what percentage this was of the entire query evaluation. However, the data itself is in a parent/child structure so while a parent operation may have a very small duration it may have many slow descendant operations. Therefore my function adds two new columns called “Exclusive Duration (Including Child Operations)” and “Exclusive Duration (%)(Including Child Operations)”. The first of these new columns returns the total duration of each operation and all its child operations, while the second returns the percentage that the duration of each operation and all its child operations make up of the whole query duration. At the root level, this means the first of these new columns shows the total duration of each query in seconds and the second of these new columns should always show 100%.

Durations

Secondly, the function adds a new column called Child Rows that contains a nested table containing the child operations associated with each operation:

ChildRows

You can either click on the Table link in this column to navigate to a new table containing all the child operations, or use the Expand/Aggregate button in the top-right hand corner to keep the current operations and join the nested rows onto them (I strongly recommend you do the former). Experienced Power Query developers will be familiar with the concept of nested tables and I think this provides an easy way of navigating the parent/child operation structure that is similar to what you see with a Decomposition Tree but keeps all the detail columns that the diagnostics data provides you with and means you don’t have to leave the Power Query Editor.

I would love to hear your feedback about this and know if there are any bugs or other information that could be added. You can download a sample pbix file for this function here.

 

Why Does Power BI Query My Data Source More Than Once?

This week I was honoured to be a guest on the Guy In A Cube channel, and for my topic I decided to tackle one of life’s eternal questions: why does Power BI query my data source more than once when I refresh my dataset?

You can watch the video here:

Although I’ve never answered this question directly in a blog post before, nevertheless almost every technique I showed in the video is something I’ve blogged about so I thought it would be useful to collect all the links to these posts in one place to provide some background to what I show in the video.

The first thing to say is that this is really a Power Query question, not just a Power BI question; a lot of what I show will therefore be relevant to Excel’s Get&Transform, Power BI dataflows and every other manifestation of Power Query out there, although I can’t guarantee that all the options and behaviour will be the same in these places.

If you’re developing in Power BI Desktop and you think that refresh is taking a long time, you should definitely check whether the Power Query engine is hitting your data source more than once. There are lots of ways to do this. Some data sources have tools that show when they are queried, such as the Run History screen in Microsoft Flow that I show in the video or SQL Server Profiler. Other ways include using Fiddler for web services or Process Monitor for files.

Next, if you find that Power Query is indeed querying your data source multiple times, the most common explanation is that it’s doing a “background refresh”. This happens by default when you refresh a dataset and is Power Query refreshing all the data previews that you see when you click on a step in the Power Query Editor window. In most cases this is so quick you don’t notice it but when you have lots of queries with lots of steps it can be a gigantic problem. You can turn it off in the Options dialog in Power BI Desktop, as shown in the video, or in the Excel Query Options dialog. I blogged about this issue here.

Another possible reason is the data privacy settings you have applied. This is a gigantic, complex topic and something that I and other people have blogged about many times. I wrote a multi-part series of posts explaining data privacy settings starting here and there’s also a great explanation of what’s going on in behind the scenes written by the dev team that I link to here; I also wrote a post here showing an example of the performance impact of data privacy checks. As I say in the post, however, do not play around with these settings unless you really know what you’re doing.

If you’re ok with writing some M code, using the technique I blogged about here to implement basic query folding on a web data source can be important – in the example in the video, it was doing this that reduced the number of calls to the web service from six to three.

Some transformations can also lead to data being read from a data source more than once. For example, in this post I showed how a merge transformation leads to multiple reads from the same Excel file and how to stop this happening and drastically improve performance.

Moving on, another important lesson to learn is how referenced queries are evaluated. This is one of the most counter-intuitive things about Power Query! With the set of referenced queries used in my demo:

…where three queries called Referenced Query 1, Referenced Query 2 and Referenced Query 3 each reference a query called Call Web Service and are loaded into a dataset (and where Call Web Service is not loaded into the dataset), most people would assume that when a refresh takes place the following happens:

  1. The Call Web Service query is run, getting the data from the web service
  2. The data returned is then passed to Referenced Query 1, Referenced Query 2 and Referenced Query 3

This is wrong. In fact what happens is this:

  1. Referenced Query 1 is refreshed. Because it references the Call Web Service query, that means the Call Web Service query is also evaluated.
  2. Referenced Query 2 is refreshed. Because it references the Call Web Service query, that means the Call Web Service query is also evaluated.
  3. Referenced Query 3 is refreshed. Because it references the Call Web Service query, that means the Call Web Service query is also evaluated.

This explains why the Call Web Service query is still being evaluated three times once the M code above has been added to it. You can read a detailed explanation of how referenced queries are evaluated here.

There is something we can do to help here though: the Power Query engine can cache the results returned from some types of data source, such as web services, but to take full advantage of this caching in this particular scenario you have to turn off the Enable Parallel Loading Of Tables option on the Options dialog as I show here. With that done, each of the three referenced queries run one after the other, instead of in parallel. When the first of these queries runs its calls the Call Web Service query, which gets data from the web service; when the next two queries run, though, and they trigger two more evaluations of Call Web Service, the data for the call to the web service has been cached. This caching can be turned off using the IsRetry option in Web.Contents as Curt Hagenlocher explains in a comment here.

The final thing shown in the video is that the infamous Table.Buffer M function does not improve performance in this case and in fact makes performance worse. If you use Table.Buffer inside the Call Web Service query it will indeed buffer the result of the call to the web service into memory, but the data that is buffered cannot be shared between separate query executions – so the buffering happens three times, once for each time Call Web Service is evaluated by the three referenced queries. Again this is something discussed in this post. Of course there are plenty of other scenarios where Table.Buffer will help performance, but these will only be in cases where the same query requests data from the same data source more than once.

Phew, that was a lot. I don’t pretend to know everything about when and why Power Query gets data from a data source multiple times but this is the sum total of my knowledge right now. I hope it’s useful!

[Update: it’s just been pointed out to me, quite rightly, that the names of my queries are confusing. For example, “Referenced Query 1” is the query doing the referencing, not the query being referenced – which is Call Web Service. I haven’t changed the names because I want the blog post to remain consistent with the video, and I hope the diagrams and the context make everything clear]

%d bloggers like this: