Returning Annotations From OData Sources In Power BI/Power Query

The Power Query OData.Feed function has an option called IncludeAnnotations that allows you to return annotation values from an OData data source. It’s not obvious how to use it though – even if you use this option when connecting, you won’t see the annotation values by default because they are returned as metadata. Here’s an example of how to get annotation values using some simple M code.

The following query gets data from the statuscode column of the bookableresourcecategories table in Dynamics CRM via OData:


let
Source =
OData.Feed(
"https://xyz.crm.dynamics.com/api/data/v9.1",
null,
[Implementation = "2.0"]),
bookableresourcecategories_table =
Source{[
Name = "bookableresourcecategories",
Signature = "table"
]}[Data],
#"Removed Other Columns" =
Table.SelectColumns(
bookableresourcecategories_table,
{"statuscode"})
in
#"Removed Other Columns"

First

To get the option set labels associated with these values, you first of all need to edit the record in the third parameter of OData.Feed and use the IncludeAnnotations option to get the FormattedValue annotation like so:


OData.Feed(
"https://xyz.crm.dynamics.com/api/data/v9.1",
null,
[Implementation="2.0",
IncludeAnnotations="OData.Community.Display.V1.FormattedValue"]
)

Then you need to add a custom column in the Power Query Editor that gets the metadata from each cell in the statuscode column using the Value.Metadata function:

CustomColumn


Value.Metadata(
[statuscode]
)[OData.Community.Display.V1.FormattedValue]?

By the way, if you’re wondering what the question mark does in this expression, it stops an error occurring if there is no OData.Community.Display.V1.FormattedValue field in the metadata record; this post has more details.

The full query looks like this:


let
Source =
OData.Feed(
"https://xyz.crm.dynamics.com/api/data/v9.1",
null,
[Implementation = "2.0",
IncludeAnnotations =
"OData.Community.Display.V1.FormattedValue"]
),
bookableresourcecategories_table =
Source{
[Name = "bookableresourcecategories",
Signature = "table"]
}[Data],
#"Removed Other Columns" =
Table.SelectColumns(
bookableresourcecategories_table,
{"statuscode"}
),
#"Added Custom" =
Table.AddColumn(
#"Removed Other Columns",
"Custom",
each Value.Metadata(
[statuscode]
)[OData.Community.Display.V1.FormattedValue]?
)
in
#"Added Custom"

…and here’s the output of the query showing the option set labels in a new column:

Last

[Thanks again to Matt Masson for providing this information]

Adding Tooltips To Column Headers In The Power Query Editor In Power BI And Excel

Here’s something that will excite all you M nerds out there! Did you know you can make tooltips appear when you hover over column headers in a table in the Power Query Editor in either Power BI Desktop or Excel?

It’s all done with M metadata, and here’s an example query:


let
source =
#table({"firstname", "lastname"}, {{"matt", "masson"}}),
tableType =
type table[firstname = Text.Type, lastname = Text.Type]
meta [
Documentation.FieldDescription =
[firstname = "Given Name", lastname = "Family Name"]
],
replaceType = Value.ReplaceType(source, tableType)
in
replaceType

Here’s what you see when your mouse hovers over the firstname column in the Power Query Editor:

Capture1

…and here’s what you see when your mouse hovers over the lastname column:

Capture2

How does this work? Here’s what each of the steps do:

  • The source step creates a simple table with two columns called firstname and lastname using #table (see here for more details on that).
  • The tabletype step declares a new table type with two columns (the same two columns in the table from the previous step) and then adds a metadata record to this type. In that record the Documentation.FieldDescription field contains the text values that will appear as tooltips when you hover over each column.
  • The replacetype step replaces the type of the table returned by source with the type declared in tabletype.

The Power Query Editor UI then looks for a Documentation.FieldDescription field in any metadata associated with a table and displays the values in that field when you hover over the appropriate column.

Note that if you add any steps to your query after this that change the table type (for example that add or remove columns), the metadata is removed ☹. That said I still feel like this might be a useful feature for anyone building a custom connector, for example.

[Thanks to Matt Masson for telling me how all this works]

 

 

Why Is Power BI Running My SQL Query Twice?

When you import data from a relational database like SQL Server in Power BI you have the option of entering your own SQL query to use as a starting point:

NativeSQL

Here’s the M code for a query that does this:


let
Source = Sql.Database(
"localhost",
"AdventureWorksDW2017",
[Query
= "SELECT [DateKey]#(lf) ,[FullDateAlternateKey]#(lf) ,
[DayNumberOfWeek]#(lf) ,[EnglishDayNameOfWeek]#(lf)
FROM [AdventureWorksDW2017].[dbo].[DimDate]"]
)
in
Source

If you’re confident writing SQL this might seem like a good option, but as I said in this blog post it has the side-effect of disabling query folding inside the Power Query query, so if you add any other transformations they will always be performed inside the Power Query engine – which may be less efficient than performing them in the data source.

There’s also another drawback: when you refresh your dataset in Power BI Desktop (although not in the Power BI Service) you’ll see that your SQL query is run twice. Here’s the evidence from SQL Server Profiler showing what happens when the query above is refreshed in Power BI Desktop:

Profiler

If your query is slow, or if each query execution costs you money, then this is something you want to avoid.

Why is this happening? It turns out this is just another example of what I blogged about here: Power BI wants to know the schema of the table before the query actually runs, so it asks Power Query to return the top 0 rows. Unfortunately, in this case query folding can’t take place and the top 0 filter can’t be pushed back to the database, so the entire query gets run once to get the schema and once to get the data.

The solution is the same as the blog post I just mentioned too: use the Table.View M function to hard-code the schema returned by the query and implement query folding manually. Here’s the adapted version of the new query:


let
Source = Sql.Database(
"localhost",
"AdventureWorksDW2017",
[Query
= "SELECT [DateKey]#(lf) ,[FullDateAlternateKey]#(lf) ,
[DayNumberOfWeek]#(lf) ,[EnglishDayNameOfWeek]#(lf)
FROM [AdventureWorksDW2017].[dbo].[DimDate]"]
),
OverrideZeroRowFilter = Table.View(
null,
[GetType = () => type table[
DateKey = Int32.Type,
FullDateAlternateKey = DateTime.Type,
DayNumberOfWeek = Byte.Type,
EnglishDayNameOfWeek = Text.Type
], GetRows = () => Source, OnTake
= (count as number) =>
if count = 0
then #table(GetType(), {})
else Table.FirstN(Source, count)]
)
in
OverrideZeroRowFilter

Profiler2

Generally speaking, I think there’s a lot to be said for creating views (if possible) instead of embedding your own SQL into a Power BI dataset – it makes maintenance and tuning much easier, and of course if you can connect straight to the view without writing any SQL in Power BI, then query folding will work and Power BI Desktop will only query the view once when you refresh.

Video: Power BI Data Privacy Settings Deep Dive

Over the past year or so I’ve been delivering a presentation on the Power Query engine’s data privacy settings at various conferences and Power BI user groups, in an attempt to try to pull together all the knowledge I have on this complex topic. Luckily, when I presented this session at the London Power BI User Group recently, they recorded it and posted it on YouTube here:

If you’re struggling with data privacy errors like:

Formula.Firewall: Query ‘Query1’ (step ‘xyz’) is accessing data sources that have privacy levels which cannot be used together. Please rebuild this data combination.

or

Formula.Firewall: Query ‘Query1’ (step ‘xyz’) references other queries or steps, so it may not directly access a data source. Please rebuild this data combination.

…in either the Power Query Editor in Power BI or Power Query/Get&Transform in Excel, then I hope this video will help you understand why you’re getting these errors and what you can do to avoid them.

A lot of what I show in this session draws on other material, such as:

  • My five-part series of posts on the Power Query data privacy settings that starts here
  • My post on how credentials and data privacy settings are stored for dynamic data sources here
  • My post here on the performance overhead of applying data privacy checks
  • Ehren von Lehe’s detailed paper on (available here) on how the engine partitions data sources while applying data privacy checks. One of the things I show in the video is that it’s now possible to see these partitions using Power Query Query Diagnostics (see here for some details – although I should probably devote a whole blog post to this in the future)

Lastly, one minor correction to something I said in the video: at the 44:32 mark I do a demo that shows how you can use M functions to avoid a Formula Firewall error. While this trick works in Power BI Desktop and Excel, it results in a dataset that can’t be refreshed in the Power BI Service unfortunately.

 

Optimising The Performance Of Power Query Merges In Power BI, Part 4: Table.Join And Other Join Algorithms

In the previous post in this series I showed how you can use the Table.Join function and the SortMerge algorithm to get much better performance for merge operations in Power Query – but only if your data is sorted. But what if your data isn’t sorted? Can those other join algorithms give you better performance? The answer is… you guessed it, it depends.

As a reminder, the seven join algorithms that can be used with Table.Join are:

  • JoinAlgorithm.Dynamic
  • JoinAlgorithm.LeftHash
  • JoinAlgorithm.LeftIndex
  • JoinAlgorithm.PairwiseHash
  • JoinAlgorithm.RightHash
  • JoinAlgorithm.RightIndex
  • JoinAlgorithm.SortMerge

The first thing to say is that if you don’t specify a join algorithm in the sixth parameter of Table.Join (it’s an optional parameter), Power Query will try to decide which algorithm to use based on some undocumented heuristics. The same thing also happens if you use JoinAlgorithm.Dynamic in the sixth parameter of Table.Join, or if you use the Table.NestedJoin function instead, which doesn’t allow you to explicitly specify an algorithm.

There are going to be some cases where you can get better performance by explicitly specifying a join algorithm instead of relying on JoinAlgorithm.Dynamic but you’ll have to do some thorough testing to prove it. From what I’ve seen there are lots of cases where explicitly setting the algorithm will result in worse performance, although there are enough cases where doing so results in better performance to make all that testing worthwhile.

For example, using the same CSV file that I’ve been using in my previous posts, I created two source queries called First and Second that only returned column A and the first 300,000 rows. Here’s the M code for First (the code for Second only differs in that it renames the only column to A2):


let
Source = Csv.Document(
File.Contents("C:\Users\chwebb\Documents\NumbersMoreColumns.csv"),
[Delimiter = ",", Columns = 7, Encoding = 65001, QuoteStyle = QuoteStyle.None]
),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
#"Removed Other Columns" = Table.SelectColumns(#"Promoted Headers", {"A"}),
#"Renamed Columns" = Table.RenameColumns(#"Removed Other Columns", {{"A", "A1"}}),
#"Kept First Rows" = Table.FirstN(#"Renamed Columns", 300000)
in
#"Kept First Rows"

Here’s a query that uses Table.Join and JoinAlgorithm.Dynamic to merge these two queries:


let
Source = Table.Join(First, {"A1"}, Second, {"A2"}, JoinKind.Inner, JoinAlgorithm.Dynamic)
in
Source

The average timings for this query on my PC were:

  • Progress Report End/25 Execute SQL – 2.0 seconds
  • Progress Report End/17 Read Data – 0.4 seconds

Changing this query to use JoinAlgorithm.LeftHash instead, like so:


let
Source = Table.Join(First, {"A1"}, Second, {"A2"}, JoinKind.Inner, JoinAlgorithm.LeftHash)
in
Source

…resulted in the following average timings:

  • Progress Report End/25 Execute SQL –  0.9 seconds
  • Progress Report End/17 Read Data – 0.6 seconds

An improvement of almost one second – but I’ve not included here all the other test results for algorithms that performed worse (I had to cancel the query that used JoinAlgorithm.LeftIndex because it was so slow). And just to be clear: I’m not saying that using JoinAlgorithm.LeftHash is always better than JoinAlgorithm.Dynamic, just that it happened to perform better in this case with these queries and this data. With different data and different queries then different algorithms may perform better. If you find other scenarios where specifying an algorithm improves performance then please let me know by leaving a comment.

Optimising The Performance Of Power Query Merges In Power BI, Part 3: Table.Join And SortMerge

In the last two posts in this series I showed how removing columns from the tables used in a Power Query merge operation can improve refresh performance. In this post I’ll show you a different trick that – when you can use it – can give you an equally significant performance boost.

When you merge data from two queries in the Power Query Editor the M code generated uses the Table.NestedJoin function. There is, however, another M function that can be used to merge data: Table.Join. The interesting thing about this function is that has a parameter that Table.NestedJoin doesn’t have: the joinAlgorithm parameter allows you to specify the algorithm used by the Power Query engine for the merge. The documentation doesn’t tell you what the possible values for this parameter are but #shared shows them:

Algs

The SortMerge algorithm, last in the list above, is the focus of this blog post. I mentioned in my earlier posts that the reason that merge operations on non-foldable data sources are often slow is that both of the tables used in the merge need to be held in memory. There is an exception though: if you know that the data in the columns used to join the two tables is sorted in ascending order, you can use the Table.Join function and the SortMerge algorithm and the data from both sources can be streamed rather than held in memory, which in turn results in the merge being much faster.

Here’s an example. As before there are two source queries that take data from a one million row CSV file, but for this test no columns are removed and there’s no filter on rows. The two source queries, called First and Second, are almost identical (in First the columns are named A1 to G1 and in Second the columns are named A2 to G2) and are not loaded into the dataset. Here’s the M code for the query called First:


let
Source = Csv.Document(
File.Contents("C:\Users\chwebb\Documents\NumbersMoreColumns.csv"),
[Delimiter = ",", Columns = 7, Encoding = 65001, QuoteStyle = QuoteStyle.None]
),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
#"Changed Type" = Table.TransformColumnTypes(
#"Promoted Headers",
{
{"A", Int64.Type},
{"B", Int64.Type},
{"C", Int64.Type},
{"D", Int64.Type},
{"E", Int64.Type},
{"F", Int64.Type},
{"G", Int64.Type}
}
),
#"Renamed Columns" = Table.RenameColumns(
#"Changed Type",
{{"A", "A1"}, {"B", "B1"}, {"C", "C1"}, {"D", "D1"}, {"E", "E1"}, {"F", "F1"}, {"G", "G1"}}
)
in
#"Renamed Columns"

First

Here’s a query that merges these queries using Table.NestedJoin and returns all columns from the source queries and is enabled:


let
Source = Table.NestedJoin(First, {"A1"}, Second, {"A2"}, "Second", JoinKind.Inner),
#"Expanded Second" = Table.ExpandTableColumn(
Source,
"Second",
{"A2", "B2", "C2", "D2", "E2", "F2", "G2"},
{"Second.A2", "Second.B2", "Second.C2", "Second.D2", "Second.E2", "Second.F2", "Second.G2"}
)
in
#"Expanded Second"

The timings for refreshing the merge query are:

  • Progress Report End/25 Execute SQL – 54 seconds
  • Progress Report End/17 Read Data – 58 seconds

[As I mentioned before, these timings may vary by a few seconds each way from run to run because paging is taking place]

Now it just so happens that in this case I know the data in the A columns in both the source queries is sorted in ascending order, so this means I can rewrite the merge query using Table.Join and the SortMerge algorithm like so:


let
Source =
Table.Join(
First, {"A1"}, Second, {"A2"},
JoinKind.Inner, JoinAlgorithm.SortMerge
)
in
Source

Here are the timings for this version:

  • Progress Report End/25 Execute SQL – 0.1 seconds
  • Progress Report End/17 Read Data – 21 seconds

This new query is clearly much faster than the original version of the merge!

Now let’s talk about the limitations of this approach. First of all, if the data in the columns used to join the two tables together is not sorted in ascending order you won’t get an error message, you’ll just get incorrect data returned, so you really need to be sure that the data is indeed sorted. Secondly, if your data is not sorted, then you can sort it in Power Query before the merge – but since sorting itself takes time and sorting for non-foldable data sources is another one of those operations which requires the table to be held in memory, you’re unlikely to get any performance improvement.

That said, I can see that there are going to be plenty of scenarios where you can use this technique. For example, if you’re extracting data to a CSV file it may be possible to specify that you want to sort the data when the extract takes place. Also, if you’re trying to merge data from two different data sources (say, SQL Server and Oracle) that both support query folding for sorting, then the time it takes to sort the data may be less than the performance gain from using the SortMerge algorithm.

 

 

Optimising The Performance Of Power Query Merges In Power BI, Part 2: Does When You Remove Columns Matter?

In my last post I demonstrated how the size of a table affects the performance of Power Query merge operations on non-foldable data sources in Power BI. Specifically, I showed that removing columns from the tables involved in a merge before the merge took place improved performance. But does it matter when you remove the columns? Is it enough to only select the columns you need when you expand the nested table returned by a merge, for example, or just to remove columns after the merge step? So, today’s question is:

Does it make a difference to Power Query merge performance if you remove unwanted columns from your source tables in the step before the merge or in the step afterwards?

The first problem I had to deal with when testing was that, as I said in my previous post, even when I reduced the number of columns in the million-row table I was using all my queries were exceeding the 256MB container size limit and paging was taking place. This meant that any refresh timings I took in Profiler varied by up to 1-2 seconds from run to run. When you’re comparing queries that take around 50 seconds with queries that take around 10 seconds a bit of variability in the timings from run to run doesn’t matter much. However, if you’re trying to compare two different test scenarios whose performance may not differ that much anyway it becomes more of a problem.

To deal with this I reduced the number of rows in the tables I was using in my merge to 300,000 rows. This kept container memory usage under 256MB and reduced the variability in timings to a minimum. Here are the average timings across multiple runs for the last scenario tested in my previous post – where I removed all but one column in my source tables before the merge and then counted the rows in the table returned by the merge – with these 300,000 row source tables:

  • Progress Report End/25 Execute SQL – 2.4 seconds
  • Progress Report End/17 Read Data – 0 seconds

I then changed my source queries so they again returned all seven columns and changed the merge query so that it removed all but the two A columns in a Removed Columns step after the step containing the merge, like so:


let
Source = Table.NestedJoin(First, {"A"}, Second, {"A"}, "Second", JoinKind.Inner),
#"Expanded Second" = Table.ExpandTableColumn(
Source,
"Second",
{"A", "B", "C", "D", "E", "F", "G"},
{"Second.A", "Second.B", "Second.C", "Second.D", "Second.E", "Second.F", "Second.G"}
),
#"Removed Other Columns" = Table.SelectColumns(#"Expanded Second", {"A", "Second.A"}),
#"Counted Rows" = Table.RowCount(#"Removed Other Columns")
in
#"Counted Rows"

The average timings for this version of the query were identical to those from the previous test:

  • Progress Report End/25 Execute SQL – 2.4 seconds
  • Progress Report End/17 Read Data – 0 seconds

Last of all, I removed all but column A from the source query called First, did not remove any columns from the source query called Second, and then only expanded column A in the #”Expanded Second” step in the merge query:


let
Source = Table.NestedJoin(First, {"A"}, Second, {"A"}, "Second", JoinKind.Inner),
#"Expanded Second" = Table.ExpandTableColumn(Source, "Second", {"A"}, {"Second.A"}),
#"Counted Rows" = Table.RowCount(#"Expanded Second")
in
#"Counted Rows"

Here are the average timings from this test:

  • Progress Report End/25 Execute SQL – 2.6 seconds
  • Progress Report End/17 Read Data – 0 seconds

These timings were slightly slower, but I don’t think the 200ms increase is really significant.

Therefore, from these tests, I think it’s safe to conclude that:

Removing unwanted columns in a step immediately after a merge results in the same performance benefits as removing unwanted columns in a step immediately before a merge

These results may surprise some of you. Why does removing columns after a merge perform so well? The answer is that just because you remove columns from a table in the step after a merge, it doesn’t mean that the Power Query engine actually removes the columns after it has performed the merge. On the contrary, it’s clever enough to realise that if you remove columns in the step after a merge then there’s no need for those columns to be present at the point where it performs the merge. You probably know how query folding works with certain data sources – well, it turns out that something similar to query folding happens within the Power Query engine whatever data source you’re using. The best explanation of this behaviour – which is fundamental to understanding how the Power Query engine works – can be found in the section on “Streaming Semantics” in this blog post by Ben Gribaudo in his excellent series on M:

https://bengribaudo.com/blog/2018/02/28/4391/power-query-m-primer-part5-paradigm

Of course there may be scenarios where this does not happen, or doesn’t happen efficiently, so I still recommend removing unwanted columns as early as possible in your queries just to be safe.

 

 

 

Optimising The Performance Of Power Query Merges In Power BI, Part 1: Removing Columns

Merging (or, in SQL terms, joining) tables in Power Query is a common cause of refresh performance problems. I’ve often wondered whether there’s anything you can do to optimise the performance of merges for non-foldable data source and so I decided to run some tests to try to answer all the questions I had. In this series of posts I’ll tell you what I found.

For these tests the only data source I used was a CSV file with one million rows and seven numeric columns named A, B C, D, E, F and G:

Csv

I used SQL Server Profiler to measure the amount of time taken for a query to execute using the technique I blogged about here. If you read that post (and I strongly recommend you do) you’ll see there are actually two Profiler events whose duration is significant when measuring refresh performance:

  • Progress Report End/25 Execute SQL
  • Progress Report End/17 Read Data

It also turns out that these two events provide some insight into how certain transformations are handled in the Power Query engine, but we’ll come to that later.

The first question I decided to investigate was this:

Does the number of columns in a table affect the performance of a merge?

First of all, I created two identical queries called First and Second that connected to the CSV file, used the first row from the file as the headers, and set the data types to all seven columns to Whole Number. Nothing very special, but here’s the M code all the same:

let
  Source = Csv.Document(
    File.Contents("C:\NumbersMoreColumns.csv"), 
    [Delimiter = ",", Columns = 7, 
     Encoding = 65001, QuoteStyle = QuoteStyle.None]
  ),
  #"Promoted Headers" = 
   Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Promoted Headers", 
    {
      {"A", Int64.Type}, 
      {"B", Int64.Type}, 
      {"C", Int64.Type}, 
      {"D", Int64.Type}, 
      {"E", Int64.Type}, 
      {"F", Int64.Type}, 
      {"G", Int64.Type}
    }
  )
in
  #"Changed Type"

I disabled these queries so that they were not loaded into the dataset.

Next, I created a third query that used the Table.NestedJoin function to merge the data from these two queries using an inner join and return all of the columns from both source queries:

let
  Source = Table.NestedJoin(
   First, {"A"}, Second, {"A"}, 
   "Second", JoinKind.Inner),
  #"Expanded Second" = Table.ExpandTableColumn(
    Source, 
    "Second", 
    {"A", "B", "C", "D", "E", "F", "G"}, 
    {"Second.A", "Second.B", "Second.C", 
     "Second.D", "Second.E", "Second.F", "Second.G"}
  )
in
  #"Expanded Second"

MergeAllCols

When I refreshed this query, in Profiler the two events I mentioned above had the following durations:

  • Progress Report End/25 Execute SQL – 40 seconds
  • Progress Report End/17 Read Data – 56 seconds

Pretty slow. But what is performance like when you merge two tables with one column instead of seven?

To test this, I added an extra step to the First and Second queries that removed all but the A columns (the ones needed for the merge) like so:

let
  Source = Csv.Document(
    File.Contents("C:\NumbersMoreColumns.csv"), 
    [Delimiter = ",", Columns = 7, 
     Encoding = 65001, QuoteStyle = QuoteStyle.None]
  ),
  #"Promoted Headers" = 
   Table.PromoteHeaders(Source, [PromoteAllScalars = true]),
  #"Changed Type" = Table.TransformColumnTypes(
    #"Promoted Headers", 
    {
      {"A", Int64.Type}, 
      {"B", Int64.Type}, 
      {"C", Int64.Type}, 
      {"D", Int64.Type}, 
      {"E", Int64.Type}, 
      {"F", Int64.Type}, 
      {"G", Int64.Type}
    }
  ),
  #"Removed Other Columns" = 
   Table.SelectColumns(#"Changed Type", {"A"})
in
  #"Removed Other Columns"

Just A

I then updated the third query that contained the merge to reflect this change:

let
  Source = Table.NestedJoin(
   First, {"A"}, Second, {"A"}, 
   "Second", JoinKind.Inner),
  #"Expanded Second" = 
   Table.ExpandTableColumn(Source, "Second", {"A"}, {"Second.A"})
in
  #"Expanded Second"

Merge

When this query was refreshed, Profiler showed the following durations:

  • Progress Report End/25 Execute SQL – 9 seconds
  • Progress Report End/17 Read Data – 1 seconds

This query is a lot quicker, but then I thought: what if the performance is more to do with the size of the table returned by the query rather than the merge? So I added an extra step to the end of the merge query, like so:

let
  Source = Table.NestedJoin(
   First, {"A"}, Second, {"A"}, 
   "Second", JoinKind.Inner),
  #"Expanded Second" = 
   Table.ExpandTableColumn(Source, "Second", {"A"}, {"Second.A"}),
  #"Counted Rows" = Table.RowCount(#"Expanded Second")
in
  #"Counted Rows"

Rowcount

…and then reran the two tests above. My thinking was that now the merge query only returns a single value the amount of data returned by the query should not be a factor in the duration of the queries.

Here are the timings for the version with the merge on the tables with seven columns:

  • Progress Report End/25 Execute SQL – 56 seconds
  • Progress Report End/17 Read Data – 0 seconds

Here are the timings for the version with the merge on the tables with just one column:

  • Progress Report End/25 Execute SQL – 14 seconds
  • Progress Report End/17 Read Data – 0 seconds

This does seem to confirm that the number of columns in a table affects the performance of a merge, although of course it might be that it takes longer to count the rows of a table that has more columns.

This shows something else too: Read Data is instant in both cases, compared to the first two tests where it took longer than the Execute SQL events.

Why does the number of columns influence the performance of a merge? If you read my recent post on monitoring memory usage with Query Diagnostics, you’ll remember that merges have to take place in memory – so I guess the larger the tables involved in the merge, the more memory is needed and the more paging happens if the 256MB limit is exceeded. Looking at the performance counter data generated for the last two queries showed that the 256MB limit was indeed exceeded for both the last two queries above, but while the version joining the table with two columns had a maximim commit of 584MB the maximum commit for the version joining the table with seven columns was almost 3GB.

That’s enough for now. At this point I think it’s fair to say the following:

Removing any unwanted columns before you merge two tables in Power Query will improve refresh performance.

Limit The Amount Of Data You Work With In Power BI Desktop Using Parameters And Deployment Pipelines

If you’re working with large amounts of data in Power BI you may find that you have problems because:

  • Your pbix file is very large
  • You spend a long time waiting for refreshes to finish in Power BI Desktop – and if you’re developing, you may need to refresh your dataset frequently
  • It takes a long time to publish your dataset to the Power BI Service

Wouldn’t it be great if there was a way to work with a small subset of your data in Power BI Desktop and then, after you publish, load all the data when you refresh? The good news is that this is now possible with the new deployment pipelines feature in Power BI!

Assuming that you know the basics of how deployment pipelines work (the documentation is very detailed), here’s a simple example of how to do this. Let’s say that you want to use data from the FactInternetSales table in the Adventure Works DW 2017 SQL Server sample database in your dataset. When you import the data from this table and open the Advanced Editor to look at the M code for the query, here’s what you’ll see:

let
  Source = Sql.Databases("MyServerName"),
  AdventureWorksDW2017 = Source{[Name = "AdventureWorksDW2017"]}[Data],
  dbo_FactInternetSales = 
   AdventureWorksDW2017{[Schema = "dbo", Item = "FactInternetSales"]}[Data]
in
  dbo_FactInternetSales

This query, of course, imports all the data from this table. To cut it down to a smaller size, the first thing to do is to create a new Power Query parameter (called FilterRows here) of data type Decimal Number:

Parameter

Notice that the Current Value property is set to 5. The purpose of this parameter is to control the number of rows from FactInternetSales that are loaded into the dataset. Here’s an updated version of the Power Query query above that uses this parameter:

let
  Source = Sql.Databases("MyServerName"),
  AdventureWorksDW2017 = Source{[Name = "AdventureWorksDW2017"]}[Data],
  dbo_FactInternetSales = 
  AdventureWorksDW2017{[Schema = "dbo", Item = "FactInternetSales"]}[Data],
  FilterLogic = 
   if 
     FilterRows <= 0 
    then 
     dbo_FactInternetSales 
    else 
     Table.FirstN(
      dbo_FactInternetSales, 
      FilterRows
     )
in
  FilterLogic

A new step called FilterLogic has been added at the end of this query that implements the following logic:

  • If the FilterRows parameter is less than or equal to 0 then return all the rows in the FactInternetSales table, otherwise
  • If FilterRows is more than 0 then return that number of rows from the table

Given that the FilterRows parameter is set to 5, this means that the query now returns only the top 5 rows from FactInternetSales:

Top5

It’s important to point out that a filter like this will only make your refreshes faster if the Power Query engine is able to apply the filter without reading all the data in the table itself. In this case it can: with a SQL Server data source query folding ensures that the SQL query generated for the refresh only returns the top 5 rows from the FactInternetSales table:

SQL

Here’s a simple report with a card that shows the number of rows loaded into the table:

Top5Report

At this point you have your cut-down dataset for development in Power BI Desktop.

Next, publish this dataset and report to a workspace that is assigned to the Development slot in a deployment pipeline and then deploy them to the Test workspace:

Pipeline

Then click the button highlighted in the screenshot above to create a new dataset rule that changes the value of the FilterRows parameter to 0 when the dataset is deployed to the Test workspace:

ParameterRule

With this rule in place, when the dataset in the Test workspace is refreshed, the logic in the query above now ensures that all the data from the FactInternetSales table is loaded into the dataset. Instead of just 5 rows, the report now shows that the full 60000 rows of data have been loaded:

FullDataReport

 

Monitoring Power Query Memory Usage With Query Diagnostics In Power BI

In the April release of Power BI Desktop the Power Query Query Diagnostics feature was enhanced so that you can now return performance counter data. As the blog post says:

When you run performance counters, every half second Power Query will take a snapshot of resource utilization. This isn’t useful for very fast queries but can be helpful for queries that use up a lot more resources.

When might this be useful in the real world? In my last blog post I had a chart that shows the amount of data that Power Query reads from disk while loading a large-ish JSON file, created from data collected in Process Monitor using this technique. Here it is again:

SecondLoad

The x axis is relative time in seconds from when Power Query started reading the data; the y axis shows the amount of data read. Notice how data is read at a constant rate for the first 1.5 seconds, but that after the 1.5 second mark the throughput flattens? What could be causing this?

Although this does not happen consistently, and I didn’t collect the necessary data when I ran this particular test, the answer is likely to be related to how the Power Query engine uses memory.

This is a topic I have blogged about before and I strongly recommend that before carrying on you read this post on the Container Size property that can be set on dataflows in Power BI Premium. Here’s a quote from Curt Hagenlocher of the Power Query dev team from that post that is relevant here too, about how the Power Query engine uses memory when a query executes:

Certain operations force the rows of a table value to be enumerated. If the enumeration operation itself is expensive, then using Table.Buffer can be a performance optimization because we store the values in memory so that second and subsequent enumerations of the rows go against memory.

If the table is only being enumerated once (which is the most common scenario) or if the underlying enumeration is fast anyway, then Table.Buffer won’t help performance.

Table.Buffer can actually hurt performance in some cases, because we cap RAM usage of the query at 256 MB — which means that a query which uses more than 256 MB is now forced to page RAM to/from disk. Enough paging, and the performance cost can be quite dramatic.

Currently, “table at a time” operations like joins, sort, many groupings, pivot, unpivot, etc., all happen in RAM (unless folded). For large tables, these will consume a lot of memory.

Now it turns out that when Power Query reads a JSON file it has to load the whole file into memory – something that isn’t true for other types of data source. So maybe, while reading the JSON file used in my test, the slow-down in throughput was caused by paging?

With the option to collect performance counters turned on (see the announcement blog post for how to do this) I opened up the pbix file with the query from my last blog post in, went to the Power Query Editor, and collected some query diagnostics data by right-clicking on the last step in my query and selecting Diagnose:

Diagnose Step

It soon became clear that the Power Query engine has a pool of mashup containers that it reuses (yes, go and read that blog post I told you to read!) and it seems like they often need to do some garbage collection before a query runs. So, to get a nice-looking graph, I did something highly unsupported that still seemed to work: I opened Task Manager and killed all the Microsoft Mashup Evaluation Container processes I could see under Power BI. Having done this, when I collected my performance counter data I could build the following graph showing Power Query memory usage while the query was evaluating:

Memory

The x axis shows the number of seconds elapsed since the start of the query; the y axis shows the value in bytes for the Commit and Working Set performance counters. The yellow line, for the Commit (bytes) performance counter, shows the amount of virtual memory used by Power Query. The blue line, for the Working Set (bytes) performance counter, shows the amount of physical memory used by Power Query; as you can see it reaches 256MB (indicated by the red dotted line) halfway through and never exceeds that. While Commit is greater than Working Set paging must be happening and Power Query performance may suffer as a result.

Over the years I have found that one of the most effective ways of improving Power Query performance is to try to avoid transformations like those that Curt mentions above that require a large tables to be held in memory – although that’s easier said than done. At least now we have an easy way of seeing where memory might be causing problems for Power Query performance using this new feature.

%d bloggers like this: