Limit The Amount Of Data You Work With In Power BI Desktop Using Parameters And Deployment Pipelines

If you’re working with large amounts of data in Power BI you may find that you have problems because:

  • Your pbix file is very large
  • You spend a long time waiting for refreshes to finish in Power BI Desktop – and if you’re developing, you may need to refresh your dataset frequently
  • It takes a long time to publish your dataset to the Power BI Service

Wouldn’t it be great if there was a way to work with a small subset of your data in Power BI Desktop and then, after you publish, load all the data when you refresh? The good news is that this is now possible with the new deployment pipelines feature in Power BI!

Assuming that you know the basics of how deployment pipelines work (the documentation is very detailed), here’s a simple example of how to do this. Let’s say that you want to use data from the FactInternetSales table in the Adventure Works DW 2017 SQL Server sample database in your dataset. When you import the data from this table and open the Advanced Editor to look at the M code for the query, here’s what you’ll see:

let
  Source = Sql.Databases("MyServerName"),
  AdventureWorksDW2017 = Source{[Name = "AdventureWorksDW2017"]}[Data],
  dbo_FactInternetSales = 
   AdventureWorksDW2017{[Schema = "dbo", Item = "FactInternetSales"]}[Data]
in
  dbo_FactInternetSales

This query, of course, imports all the data from this table. To cut it down to a smaller size, the first thing to do is to create a new Power Query parameter (called FilterRows here) of data type Decimal Number:

Parameter

Notice that the Current Value property is set to 5. The purpose of this parameter is to control the number of rows from FactInternetSales that are loaded into the dataset. Here’s an updated version of the Power Query query above that uses this parameter:

let
  Source = Sql.Databases("MyServerName"),
  AdventureWorksDW2017 = Source{[Name = "AdventureWorksDW2017"]}[Data],
  dbo_FactInternetSales = 
  AdventureWorksDW2017{[Schema = "dbo", Item = "FactInternetSales"]}[Data],
  FilterLogic = 
   if 
     FilterRows <= 0 
    then 
     dbo_FactInternetSales 
    else 
     Table.FirstN(
      dbo_FactInternetSales, 
      FilterRows
     )
in
  FilterLogic

A new step called FilterLogic has been added at the end of this query that implements the following logic:

  • If the FilterRows parameter is less than or equal to 0 then return all the rows in the FactInternetSales table, otherwise
  • If FilterRows is more than 0 then return that number of rows from the table

Given that the FilterRows parameter is set to 5, this means that the query now returns only the top 5 rows from FactInternetSales:

Top5

It’s important to point out that a filter like this will only make your refreshes faster if the Power Query engine is able to apply the filter without reading all the data in the table itself. In this case it can: with a SQL Server data source query folding ensures that the SQL query generated for the refresh only returns the top 5 rows from the FactInternetSales table:

SQL

Here’s a simple report with a card that shows the number of rows loaded into the table:

Top5Report

At this point you have your cut-down dataset for development in Power BI Desktop.

Next, publish this dataset and report to a workspace that is assigned to the Development slot in a deployment pipeline and then deploy them to the Test workspace:

Pipeline

Then click the button highlighted in the screenshot above to create a new dataset rule that changes the value of the FilterRows parameter to 0 when the dataset is deployed to the Test workspace:

ParameterRule

With this rule in place, when the dataset in the Test workspace is refreshed, the logic in the query above now ensures that all the data from the FactInternetSales table is loaded into the dataset. Instead of just 5 rows, the report now shows that the full 60000 rows of data have been loaded:

FullDataReport

 

Monitoring Power Query Memory Usage With Query Diagnostics In Power BI

In the April release of Power BI Desktop the Power Query Query Diagnostics feature was enhanced so that you can now return performance counter data. As the blog post says:

When you run performance counters, every half second Power Query will take a snapshot of resource utilization. This isn’t useful for very fast queries but can be helpful for queries that use up a lot more resources.

When might this be useful in the real world? In my last blog post I had a chart that shows the amount of data that Power Query reads from disk while loading a large-ish JSON file, created from data collected in Process Monitor using this technique. Here it is again:

SecondLoad

The x axis is relative time in seconds from when Power Query started reading the data; the y axis shows the amount of data read. Notice how data is read at a constant rate for the first 1.5 seconds, but that after the 1.5 second mark the throughput flattens? What could be causing this?

Although this does not happen consistently, and I didn’t collect the necessary data when I ran this particular test, the answer is likely to be related to how the Power Query engine uses memory.

This is a topic I have blogged about before and I strongly recommend that before carrying on you read this post on the Container Size property that can be set on dataflows in Power BI Premium. Here’s a quote from Curt Hagenlocher of the Power Query dev team from that post that is relevant here too, about how the Power Query engine uses memory when a query executes:

Certain operations force the rows of a table value to be enumerated. If the enumeration operation itself is expensive, then using Table.Buffer can be a performance optimization because we store the values in memory so that second and subsequent enumerations of the rows go against memory.

If the table is only being enumerated once (which is the most common scenario) or if the underlying enumeration is fast anyway, then Table.Buffer won’t help performance.

Table.Buffer can actually hurt performance in some cases, because we cap RAM usage of the query at 256 MB — which means that a query which uses more than 256 MB is now forced to page RAM to/from disk. Enough paging, and the performance cost can be quite dramatic.

Currently, “table at a time” operations like joins, sort, many groupings, pivot, unpivot, etc., all happen in RAM (unless folded). For large tables, these will consume a lot of memory.

Now it turns out that when Power Query reads a JSON file it has to load the whole file into memory – something that isn’t true for other types of data source. So maybe, while reading the JSON file used in my test, the slow-down in throughput was caused by paging?

With the option to collect performance counters turned on (see the announcement blog post for how to do this) I opened up the pbix file with the query from my last blog post in, went to the Power Query Editor, and collected some query diagnostics data by right-clicking on the last step in my query and selecting Diagnose:

Diagnose Step

It soon became clear that the Power Query engine has a pool of mashup containers that it reuses (yes, go and read that blog post I told you to read!) and it seems like they often need to do some garbage collection before a query runs. So, to get a nice-looking graph, I did something highly unsupported that still seemed to work: I opened Task Manager and killed all the Microsoft Mashup Evaluation Container processes I could see under Power BI. Having done this, when I collected my performance counter data I could build the following graph showing Power Query memory usage while the query was evaluating:

Memory

The x axis shows the number of seconds elapsed since the start of the query; the y axis shows the value in bytes for the Commit and Working Set performance counters. The yellow line, for the Commit (bytes) performance counter, shows the amount of virtual memory used by Power Query. The blue line, for the Working Set (bytes) performance counter, shows the amount of physical memory used by Power Query; as you can see it reaches 256MB (indicated by the red dotted line) halfway through and never exceeds that. While Commit is greater than Working Set paging must be happening and Power Query performance may suffer as a result.

Over the years I have found that one of the most effective ways of improving Power Query performance is to try to avoid transformations like those that Curt mentions above that require a large tables to be held in memory – although that’s easier said than done. At least now we have an easy way of seeing where memory might be causing problems for Power Query performance using this new feature.

Speed Up Data Refresh Performance In Power BI Desktop Using Table.View

It can sometimes be frustrating to work with slow data sources or complex Power Query queries in Power BI Desktop: you open the Power Query Editor, make some changes, click Close & Apply and then wait a loooong time for your data to refresh. In this post I’m going to show you a technique that can cut this wait by up to 50%. It involves some fairly complex M code but I promise you, the effort is worth it!

In fact, what I’m going to describe is more or less what I showed towards the end of my appearance on Guy In A Cube last year and in a few other posts, but at that time I didn’t understand properly why it worked or what the performance implications actually were. Now, thanks to a lot of help from Curt Hagenlocher of the Power Query development team I have all the details I need to blog about it.

Let’s see a simple example. Say you have a large JSON file – for this test I generated one that is 67MB containing random data using this handy online tool – and you want to load it into Power BI. You’ll end up with a query that looks something like this:

let
  Source = Json.Document(File.Contents("C:\generated.json")),
  #"Converted to Table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ),
  #"Expanded Column1" = Table.ExpandRecordColumn(
    #"Converted to Table", 
    "Column1", 
    {"_id", "index", "guid", "isActive", "balance"}, 
    {"_id", "index", "guid", "isActive", "balance"}
  ),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Expanded Column1", 
    {
      {"_id", type text}, 
      {"index", Int64.Type}, 
      {"guid", type text}, 
      {"isActive", type text}, 
      {"balance", type text}
    }
  )
in
  #"Changed Type"

Here’s what the output of this query looks like:

Output

On my laptop this query takes around 7-8 seconds to load. Using Process Monitor to see how much data is being read from the file (see this post for more details) shows that Power Query is reading the data from this file twice. This graph, generated using data from Process Monitor, has time on the x axis and amount of data read from the file on the y axis, and the two peaks indicate that the data is being read twice:

FirstLoad

Why? When you refresh a table in Power BI Desktop two things happen:

  • First, Power BI has to check what columns are present in the table. To do this, it runs the query for the table but asks the Power Query engine to filter the table so it returns zero rows.
  • Secondly the query is run again but this time with no filter, so all rows are returned, and this is when the data is actually loaded into the table in Power BI.

It’s on this first run of the query where problems can occur. If you’re using a data source like SQL Server and you’re not doing any complex transformations then query folding will take place, so Power BI’s request to return the table but with no rows can be handled very efficiently. However, if you’re using a data source where query folding is not possible (such as Excel, CSV or JSON files) or you have complex transformations that stop folding taking place then the only way Power Query can work out what columns the query returns is by running the entire query. That’s what is happening in the example above.

This only happens in Power BI Desktop, though. When you refresh a dataset in the Power BI Service the query is only run once.

The solution is to trick the Power Query engine into running these zero-row filter queries instantly, and you can do that using the Table.View M function. I blogged about Table.View here (and I strongly suggest you read that post before carrying on) but that example didn’t quite get to the solution you need here. Here’s a new version of the query above with an extra step:

let
  Source = Json.Document(File.Contents("C:\generated.json")),
  #"Converted to Table" = Table.FromList(
    Source, 
    Splitter.SplitByNothing(), 
    null, 
    null, 
    ExtraValues.Error
  ),
  #"Expanded Column1" = Table.ExpandRecordColumn(
    #"Converted to Table", 
    "Column1", 
    {"_id", "index", "guid", "isActive", "balance"}, 
    {"_id", "index", "guid", "isActive", "balance"}
  ),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Expanded Column1", 
    {
      {"_id", type text}, 
      {"index", Int64.Type}, 
      {"guid", type text}, 
      {"isActive", type text}, 
      {"balance", type text}
    }
  ),
  OverrideZeroRowFilter = Table.View(
    null, 
    [
    GetType = () => 
      type table[
      _id = Text.Type, 
      index = Int64.Type, 
      guid = Text.Type, 
      isAction = Text.Type, 
      balance = Text.Type
    ], 
    GetRows = () => 
      #"Changed Type", 
    OnTake = (count as number) => 
      if count = 0 then 
      #table(
      type table[
        _id = Text.Type, 
        index = Int64.Type, 
        guid = Text.Type, 
        isAction = Text.Type, 
        balance = Text.Type
      ], 
      {}
    ) 
    else 
     Table.FirstN(#"Changed Type", count)]
  )
in
  OverrideZeroRowFilter

The OverrideZeroRowFilter step is where the magic happens. It uses Table.View to override query folding behaviour by intercepting what happens when the table returned by the query is filtered. There are three fields in the record in the second parameter of Table.View that you need to change:

  • GetType returns a table type that describes the columns and their data types present in the output of the query. The six columns listed here are the six columns you can see in the screenshot of the query output above. It’s very easy to generate the required M code when you use the custom function that I blogged about here.
  • GetRows returns all the rows that the query can return, that’s to say the table returned by the #”Changed Type” step.
  • OnTake is used when a top n filter is applied to the table returned by the query.  In this case the code looks at the number of rows that are being requested (specified in the count parameter) and if that’s zero, it just returns an empty table with the same columns as the #”Changed Type” step; if it’s more than zero then it uses Table.FirstN to get the actual number of rows requested from #”Changed Type”. This means that when Power BI does that zero-row filter it can now happen immediately because there’s no need to go back to the data source or execute any of the transformations in the query.

This version of the query now runs in 4-5 seconds, and Process Monitor now shows that the JSON file is only read once and, obviously, reading the data once is a lot faster than reading it twice:

SecondLoad

Look at how the graph flattens at the end… but that’s something for another blog post.

There is a downside to this approach: you have to hard-code the table schema that you expect your query to return, and if you change your query to return different columns you’ll have to update the table type in the last step.

I’ve used this technique on other slow queries and it has made a massive difference to the development experience in Power BI Desktop – one query that was taking five minutes to refresh when I closed the Power Query Editor went down to two and a half minutes. If you try this out yourself please let me know if it makes a difference by leaving a comment.

[Don’t forget that there are other things you can do that will also speed up the development experience for slow queries: you should definitely turn off the “Allow data preview to download in the background” option, and you might want to consider turning off data privacy checks so long as you fully understand what the implications]

 

Power Query Formatter

If you’ve ever used DAX Formatter to format your DAX code and wondered why there isn’t an equivalent for Power Query/M, then wonder no more: the nice people at Power Pivot Insights in Germany (read their German-language blog here) have built one. You can find it here:

https://powerqueryformatter.com/

PQ Formatter

They also have an API you can use too.

Bonus information: did you know that Microsoft also has an open source parser for M, available here?

Keep The Existing Data In Your Power BI Dataset And Add New Data To It Using Incremental Refresh

Power BI incremental refresh is a very powerful feature and now it’s available in Shared capacity (not just Premium) everyone can use it. It’s designed for scenarios where you have a data warehouse running on a relational database but with a little thought you can make it do all kinds of other interesting things; Miguel Escobar’s recent blog post on how to use incremental refresh for files in a folder is a great example of this. In this post I’m going to show you how to use incremental refresh to solve another very common problem – namely how to get Power BI to keep the data that’s already in your dataset and add new data to it.

I know what you’re thinking at this point: isn’t this what incremental refresh is meant to do anyway? Well, yes it is, but as I said it’s designed to work in scenarios where a relational data warehouse stores a copy of all the data that’s in your dataset and in some cases you don’t have that luxury. For example, you may want to:

  • Connect to a data source that only gives you a set of new sales transactions each day, and add these sales transactions to the ones you have already stored in a Power BI dataset
  • Take a snapshot of a data source, like an Excel workbook, that is changing all the time and store each of these daily snapshots of the contents of the workbook in a Power BI dataset

The current Power BI incremental refresh functionality doesn’t make it easy to do either of these things, and that’s why I’ve written this post.

Let me be clear though: in all these cases I recommend that you don’t use the technique I’m going to show you. If possible, you should stage a copy of each day’s data in a relational database (ie build that data warehouse) or even as text files in a folder (Power Automate may be useful to do this) and use that staged copy as the data source for Power BI. This will allow you to do a full refresh of the data in your dataset at any point in the future if you need to, or create a completely new dataset, even though it means you have to do a lot of extra work. If it isn’t possible to do this, or you’re too lazy or you’re just curious to see how my technique works, read on.

For the example I’m going to show in this post I’m going to use a web-based data source, an RSS feed from the BBC News website that returns a list of the current top stores on the site. You can find it here:

http://feeds.bbci.co.uk/news/rss.xml

RSS is based on XML and there’s no authentication needed to access this feed, so loading a table of these top stories into Power BI is very easy – so easy, I’m not going to bother explaining how to do it. Instead I’m going to show you how to use incremental refresh to store a copy of this list of top stories every day in the same table in a dataset.

First of all, for incremental refresh to work two Power Query parameters of data type date time need to be created called RangeStart and RangeEnd. More details about how to create them can be found in the docs here. Power BI expects them to be used to filter the rows loaded into a table; it also uses them to partition these tables in the dataset.

Next you need a Power Query query to load the news stories. Here’s the M code for the query:

As I said, the first couple of steps aren’t interesting – they just connect to the RSS feed to get the list of top stories as a table. After that:

  • The CurrentDateTime step gets the current date and time at the point when the dataset refreshes. It’s important to note that I’ve used the DateTimeZone.FixedUtcNow function – this not only gives you the current UTC date and time, but it’s guaranteed to give you the same date and time every time you call it within a query. If you use DateTimeZone.UtcNow you may get a different date time returned every time the function is called within the query, which can make things very confusing.
  • The PreviousDay step gets yesterday’s date by extracting the date from the value found in the CurrentDateTime step and subtracting one day.
  • The #”Added Custom” step adds a new column to the table called “UTC Data Load Date” containing the value returned by the CurrentDateTime step. As the name suggests, this column shows when the data in any particular row was loaded.
  • The #”Filtered Rows” step is extremely important. It filters the rows in the table: it will return all the rows in the table if the RangeStart parameter returns yesterday’s date, otherwise it will return no rows at all. It also has an AND condition that checks whether the RangeEnd parameter is not null, which should always be true. Without this step that uses RangeStart and RangeEnd in some way you won’t be able to turn on incremental refresh; the significance of only returning data for yesterday’s date will become clear later.

At this point, unless you happen to have set the value of the RangeStart parameter to yesterday’s date, the table returned by the query will be empty.

Once you have loaded this table to your Power BI dataset (I called it “News”) you need to configure incremental refresh on it. You can do this by right-clicking on the table in the Fields pane in the main Power BI window and selecting Incremental refresh from the menu. Here’s how you will need to configure the settings:

IncrementalConfig

You can ignore the warning about query folding and the M query at the top. What you need to do here is:

  • Set the “Incremental refresh” slider to On
  • Under “Store rows in the last” choose Days from the dropdown and enter the number of days you want to store data for. You probably don’t need to store data indefinitely; if you did, your dataset might get very large. Data that is older than the number of days specified here will be deleted from the dataset.
  • Under “Refresh rows in the last” again select Days from the dropdown and enter 1.
  • Do not check the boxes for “Detect data changes” and “Only refresh complete day”.

With that done all you need to do is publish to the Power BI Service and set up scheduled refresh for once a day (and no more). When the report refreshes each day, all the stories listed in the RSS feed will be added to the existing stories in the dataset. To monitor this I created a measure called Story Count to count the number of rows in the News table, and then created a simple report that showed the total value of this measure and also showed it broken down by the UTC Data Load Date column. Here’s what it looked like yesterday (the 12th of April 2020), after it had already been refreshed for a few days:

ReportYesterday

and here’s what it looked like today (the 13th of April 2020), with 51 more stories having been loaded in this morning:

ReportToday

Job done!

The last thing to point out is that connecting to the XMLA endpoint (or should I say connecting via the “Analysis Services protocol”…?) for the workspace using SQL Server Management Studio, if it’s in Premium, makes it a lot easier to understand what’s happening behind the scenes here. If you right-click on your table in SQL Server Management Studio’s Object Explorer pane and select Partitions:

PartitionsSSMS

…You can see the names of the 51 partitions created to hold the data (remember, above we opted to store 50 days of historical data), the number of rows of data in each partition and the dates that these partitions were last refreshed (or “processed” in Analysis Services terminology):

PartitionRefresh

Notice that the partition for today’s date, April 13th, is empty and that both the partition for today and yesterday (April 12th) have been refreshed today; similarly, the partition for April 11th was last refreshed yesterday on April 12th. The two most recent partitions are always refreshed and this is is why the #”Filtered Rows” step in the M code above only returns rows when the RangeStart parameter holds yesterday’s date, to make sure that the stories for the current day are only stored once.

You can also script out the table to TMSL like so:

PartitionsScript

This allows you to see the definitions of each partition and, crucially, the start and end values that are passed to the RangeStart and RangeEnd parameters when they are refreshed:

TOM

You can download my example pbix file for this post here.

Understanding The “The key didn’t match any rows in the table” Error In Power Query In Power BI Or Excel

One of the most common errors you’ll see when working with Power Query in Power BI or Excel is this:

Expression.Error: The key didn’t match any rows in the table

KeyErrorScreenshot

It can occur with almost any data source and in a wide variety of different circumstances, and for new users of Power Query it can be very confusing. In this post I’ll explain what the error message means and when you’re likely to encounter it using a simple example.

TL;DR You’re probably getting this error because your Power Query query is trying to connect to a table or worksheet or something in your data source that has been deleted or been renamed.

Say you have an Excel workbook with three worksheets in called Sheet1, Sheet2 and Sheet3:

WorksheetsBefore

On Sheet1 there’s some data you want to load into Power BI or Excel using Power Query:

Data

It’s very easy to do this: after you click on the Excel data source and select your Excel workbook, you see the Navigator window with the contents of the workbook and select Sheet1:

Nav

…click OK and your query is ready:

PQQuery

What actually happens in this query though? If you click on each of the four steps in the Applied Steps pane on the right-hand side of the screen, you’ll see what each step does. In particular, notice that the first step in the query (called Source) returns a table with one row for each worksheet, table and named range in the workbook:

SourceStep

This is something that happens with many types of data source: after you have connected you have to choose what data inside the data source you want to connect to in the first step, and the list of things you can connect to is returned in a table. This table lists the same things displayed in the Navigator dialog shown above. In the same way, if you connect to a SQL Server database this first step returns a table with all the tables and views in the database; if you connect to an OData feed the first step shows all the resources available through the feed.

The other thing to notice in the table in the screenshot is the Data column, which contains nested table values. If you were to click on one of these nested tables you would navigate to the data in the worksheet named on that row and that is in fact what the second step in the query does.

Here’s the M code for the whole query:


let
Source =
Excel.Workbook(
File.Contents(
"C:\MyExcelFile.xlsx"),
null,
true),
Sheet1_Sheet =
Source{[Item="Sheet1",Kind="Sheet"]}[Data],
#"Promoted Headers" =
Table.PromoteHeaders(
Sheet1_Sheet,
[PromoteAllScalars=true]
),
#"Changed Type" =
Table.TransformColumnTypes(
#"Promoted Headers",
{{"Month", type text},
{"Sales", Int64.Type}}
)
in
#"Changed Type"

If you want to understand how the code in the second step of this query gets the contents of the cell containing the nested table, I have a blog post here that goes into a lot more detail. Basically the second step (called Sheet1_Sheet in the code, but shown as Navigation in the Applied Steps pane) returns the nested table in the Data column from the row in the table where the Item column contains the value “Sheet1” and the Kind column contains the value “Sheet”. It’s the contents of the Item and Kind columns that are used to identify the row in the table that contains the data you want to see – in database terminology these columns are the key columns on the table.

What happens if you change the name of the worksheet that contains the data you want, from “Sheet1” to “Hello”?

WorksheetsAfter

The result is that the first step of the query now returns a table that looks like this:

AfterNav

Since there isn’t a row where the Item column contains “Sheet1” any more, the second step can no longer find the row it’s looking for – the key value it’s looking for no longer exists in the table – which is why you see the error message you do:

KeyErrorScreenshot

There are two ways to fix the error:

  1. Go to the Excel workbook and change the name of the sheet called “Hello” back to “Sheet1” or
  2. In either the Advanced Editor or the Formula Bar edit the M code and replace the reference to “Sheet1” with “Hello” in the second step of the query:
    FormulaBar

In summary, when you’re working with data sources in Power Query you need to be aware that if something in your data source is renamed or deleted then your query will break. You can of course write some clever M code to deal with situations like this but my recommendation is to try to fix the problem in the data source and not in your query.

Adding Your Own Messages To Power Query Query Diagnostics

A quick point: while the Power Query Query Diagnostics functionality is relatively new, it’s based on Power BI/Power Query trace logging that has been around for a while. I’ve just realised that this means you can use the Diagnostics.Trace M function that I blogged about back in 2016 to add your own messages to the output of Query Diagnostics. Using the example query from that blog post, here’s what you’ll see in the detailed Query Diagnostics output query when that query is run (I’ve removed all but the important columns to make it easier to read):

TraceOutput

The amount of detail you get with Query Diagnostics can be overwhelming; inserting your own messages should make it a lot easier to work out what is happening, when and how often – especially in scenarios where the query name and step name aren’t shown.

An M Function To Help You Explore Power Query Diagnostics Data

Last week’s post showed an M function that took Power Query diagnostics data and formatted in a way that made it suitable for visualisation in a Power BI Decomposition Tree visual. This is great for understanding what’s going on at a high level, but by doing this you also lose a lot of detailed information from the diagnostics logs that could be useful for performance tuning. This week I have an M function for you that takes a different approach to solving the same problem that might be more appealing for the hardcore Power Query fans out there.

Here’s the code for the function:

As before you need to pass a query created by the Power Query diagnostics functionality to the function:

InvokeFunction

[Note: I recommend calling the function on the “detailed” diagnostics query and not the summarised view, because I’ve found what might be a bug with the summarised view]

The function will then return a table with one row for each query that is evaluated while the diagnostics were running:

QueryList

[Note: You may find that some query evaluations have nulls instead of the name of the query; that’s an issue with the diagnostics data that will be addressed soon]

You could get the same thing simply by filtering the raw diagnostics data down to the rows where the Path column contains 0, but I’ve added two extra things that make the data much easier to understand.

First of all, the original data contains columns called “Exclusive Duration” and “Exclusive Duration (%)”, which tell you the amount of time each operation took in seconds and what percentage this was of the entire query evaluation. However, the data itself is in a parent/child structure so while a parent operation may have a very small duration it may have many slow descendant operations. Therefore my function adds two new columns called “Exclusive Duration (Including Child Operations)” and “Exclusive Duration (%)(Including Child Operations)”. The first of these new columns returns the total duration of each operation and all its child operations, while the second returns the percentage that the duration of each operation and all its child operations make up of the whole query duration. At the root level, this means the first of these new columns shows the total duration of each query in seconds and the second of these new columns should always show 100%.

Durations

Secondly, the function adds a new column called Child Rows that contains a nested table containing the child operations associated with each operation:

ChildRows

You can either click on the Table link in this column to navigate to a new table containing all the child operations, or use the Expand/Aggregate button in the top-right hand corner to keep the current operations and join the nested rows onto them (I strongly recommend you do the former). Experienced Power Query developers will be familiar with the concept of nested tables and I think this provides an easy way of navigating the parent/child operation structure that is similar to what you see with a Decomposition Tree but keeps all the detail columns that the diagnostics data provides you with and means you don’t have to leave the Power Query Editor.

I would love to hear your feedback about this and know if there are any bugs or other information that could be added. You can download a sample pbix file for this function here.

 

Visualising Power Query Diagnostics Data In A Power BI Decomposition Tree

Recently I’ve been working with the Power Query team to come up with some ways to help developers understand the data returned by the new Power Query diagnostics functionality. In this, the first of two posts, I’m going to share a Power Query function that reformats diagnostics data in a way that makes it easy to visualise using the Power BI Decomposition Tree visual.

Here’s the code for the function:

Here’s how to use it:

  1. Copy the code above into a new blank query in the Power Query Editor window, and give the query a meaningful name like DiagnosticsForDecompTree
  2. Pass in the name of a query created by using the Query Diagnostics functionality (you’ll be able to choose any query in your pbix file using the dropdown box) and invoke the functionInvokeFunction
  3. Load the output of the new query that gets created by invoking the function into your Power BI dataset
  4. Add a new Decomposition Tree visual to a report page
  5. Drag the Exclusive Duration column into the Analyze well of the visual
  6. Drag all the columns whose names start with “Level” (ie “Level.1”, “Level.2” and so on) into the Explain By well of the visualWell
  7. Add a filter or slicer on the Query column to filter the data shown in the visual down to a single query evaluation.

The output will look something like this:

DecompTree

What does this function do? The original output of a diagnostics query has one row for each event; each event may have one or more child event, and the Path column shows the details of this parent/child structure. The function takes this structure and expands it out to multiple columns, so for each event you see all of its ancestor events in a separate column on the same row. Since each event’s duration only gives you the time taken for it to run and doesn’t include the time taken its child events, when you break down an event in the Decomposition Tree you’ll see some child events with the name (Blank) – these give you the duration of just the parent event and make sure everything adds up properly. This function should work for both the detailed diagnostics output as well as the summary output.

You can download a sample pbix file showing an example of how to use it here.

This is only a first draft of the function, so I’d really appreciate it if you could try this with your own diagnostics data and check if it works properly and also if it produces useful output. Please leave your feedback in a comment either here or on the Github Gist. There are almost certainly going to be bugs!

Of course the real question is is how you interpret what you see in the Decomposition Tree and how you can use it to improve the performance of your queries. That, my friends, is a journey we’re on together, but I think that being able to visualise diagnostics data easily is an important first step to being able to understand what’s going on inside the Power Query engine when a query is evaluated.

[Thanks to Jorge Gomez Basanta, Colin Popell and Matt Masson for their help with this code]

 

New “Diagnose Step” Power Query Feature In Power BI

One new feature that was introduced in the December 2019 release of Power BI Desktop, but which wasn’t mentioned in the release blog post, is the “Diagnose Step” button on the Tools tab of the Power Query Editor ribbon:

image

As you might have guessed, it’s closely related to the Query Diagnostics functionality that was introduced back in October. Whereas the existing Query Diagnostics functionality allows you to see what happens inside the Power Query when a query is executed, this new feature does something similar but allows you to run a query up to a specific step. This is useful in scenarios where you want to reduce the diagnostics data you are collecting to a subset of the steps in the query without having to comment out a lot of M code.

To illustrate this, consider the following super-simple query with three steps:

image

If you click the Start Diagnostics button on the Tools tab of the ribbon of the Power Query Editor, refresh the query from the main Power BI window (as I show here), then click the Stop Diagnostics button, then the query will be run, the result of Step3 will be loaded into the dataset and the diagnostic data collected will be for the whole query.

However, if you select Step2 in the Applied Steps pane:

image

…and then click the new Diagnose Step button on the Tools tab of the ribbon, the query will only be executed up to Step2 (meaning any steps that Step2 is dependent on will also be evaluated) and the diagnostics data collected will reflect that.

%d bloggers like this: