Excel Cube Functions, Dynamic Arrays And Lambdas, Part 2: Returning Tables

In the first post in this series I showed how to use the new Excel Lambda helper functions to return an array containing all the items in a set. That isn’t very useful on its own, so in this post I’ll show you how to generate an entire dynamic table using Excel cube functions and Lambda helper functions.

In this post I’ll be using the same source data as in my previous post: a table containing sales data with just two columns.

With this table added to the Excel Data Model/Power Pivot, I created two measures:

I then created created two sets using CubeSet containing the sets of Products (in cell B2 of my worksheet) and Measures (in cell B4) to use in my table:

=CUBESET("ThisWorkbookDataModel", "[Sales].[Product].[Product].MEMBERS", "Product Set")

=CUBESET("ThisWorkbookDataModel", "{[Measures].[Sales Amount], [Measures].[Forecast Sales]}", "Measure Set")

Here are the formulas shown in the worksheet:

And here’s the output – remember you only see the text in the third parameter displayed in the cell:

Now, here’s the fun part – a single formula that takes these sets and builds a table with the Measures on columns and the Products on rows:

=MAKEARRAY(
  CUBESETCOUNT(B2)+1,
  CUBESETCOUNT(B4)+1,
  LAMBDA(r,c,
   SWITCH(
    TRUE(),
    AND(r=1,c=1),
    "",
    c=1,
    CUBERANKEDMEMBER("ThisWorkbookDataModel",$B$2,r-1),
    r=1,
    CUBERANKEDMEMBER("ThisWorkbookDataModel",$B$4,c-1),
    CUBEVALUE("ThisWorkbookDataModel",
     CUBERANKEDMEMBER("ThisWorkbookDataModel",$B$2,r1),
     CUBERANKEDMEMBER("ThisWorkbookDataModel",$B$4,c-1)
    )
   )
  )
)

Here’s what this formula returns:

How does this work? Going through the MakeArray function step-by-step:

  • The first two parameters specify that the output will be an array with one more row than there are items in the Product set and one more column than there are items in the Measures set.
  • The third parameter returns a Lambda that is called for every cell in this array. This Lambda contains a Switch with the following conditions:
    • For the top-left cell in the array, return a blank value
    • In the first column, use the CubeRankedMember function to return the Products on the rows of the table
    • In the first row, use the CubeRankedMember function to return the Measures on the columns of the table
    • In the body of the table, use the CubeValue function to return the values

Here’s a slightly more ambitious version that returns the same table but adds a total row to the bottom:

=
LET(
 NumberOfRows,
 CUBESETCOUNT(B2)+2,
 NumberOfColumns,
 CUBESETCOUNT(B4)+1,
 MAKEARRAY(
  NumberOfRows,
  NumberOfColumns,
  LAMBDA(r,c,
   SWITCH(
    TRUE(),
    AND(r=1,c=1),
    "",
    AND(r=NumberOfRows,c=1),
    "Total",
    r=NumberOfRows,
    CUBEVALUE("ThisWorkbookDataModel",
     $B$2,
     CUBERANKEDMEMBER("ThisWorkbookDataModel",$B$4,c-1)),
    c=1,
    CUBERANKEDMEMBER("ThisWorkbookDataModel",$B$2,r-1),
    r=1,
    CUBERANKEDMEMBER("ThisWorkbookDataModel",$B$4,c-1),
    CUBEVALUE("ThisWorkbookDataModel",
     CUBERANKEDMEMBER("ThisWorkbookDataModel",$B$2,r-1),
     CUBERANKEDMEMBER("ThisWorkbookDataModel",$B$4,c-1))
    )
   )
  )
)

Two extra things to note here:

  • This is a great example of a complex formula where the new Excel Let function can be used to improve readability and prevent the same value being evaluated twice.
  • The values in the Total row are calculated in the Excel Data Model, not on the worksheet, by using the CubeSet function inside the CubeValue function. This means that the totals will be consistent with what you see in a PivotTable and therefore correct

This is still very much a proof-of-concept. I need to look at the performance of this approach (it may not be optimal and may need tuning), and I’m not sure how a table like this could be formatted dynamically (especially the Total row). It is exciting though!

Excel Cube Functions, Dynamic Arrays And Lambdas, Part 1: Getting All The Items From A Set

After my recent post on using Office Scripts and cube functions to generate Excel reports from Power BI data, Meagan Longoria asked me this question on Twitter:

To which I can only reply: guilty as charged. I have always loved the Excel cube functions even though they are maybe the least appreciated, least known and least used feature in the whole Microsoft BI stack. They have their issues (including sometimes performance) but they are great for building certain types of report in Excel that can’t be built any other way.

Anyway, the recent addition of new Lambda helper functions to Excel has made me particularly happy because they can be used with cube functions to overcome some limitations that have existed since cube functions were first released in Excel 2007, and to do some other cool things too. In this series of posts I’m going to explore some of the things they make possible.

Let’s start with something simple. In Excel, the CubeSet function can be used to return an (MDX) set of items. This set is stored in a single cell, though, and to extract each item into a cell on your worksheet you need to use the CubeRankedMember function. For example, let’s say I have a table called Sales on my worksheet:

…that is then loaded into the Excel Data Model (aka Power Pivot – although this works exactly the same if I use a Power BI dataset, Azure Analysis Services or SQL Server Analysis Services as my source):

What you can then do is use the CubeSet function to create a set of all the products like so:

=CUBESET("ThisWorkbookDataModel", "[Sales].[Product].[Product].MEMBERS", "Product Set")

…and then use the CubeRankedMember function to put each individual item of the set into a cell. Here’s a simple example worksheet, first with the formulas showing and then the results:

This example shows the fundamental problem that has always existed with CubeRankedMember though: in order to show all the items in a set you need to know how many items there are in advance, and populate as many cells with CubeRankedMember formulas as there are items. In this case see how the range B4:B6 contains the numbers 1, 2 and 3; these numbers are used in the formulas in the range C4:C6 to get the first, second and third items in the set.

If a fourth product was added to the table, however, it would not appear automatically – you would have to add another cell with another CubeRankedMember formula in it manually. I’ve seen some workarounds but they’re a bit hacky and require you to know what the maximum possible number of items in a set could ever be. Indeed that’s always been one of the key differences between cube functions and PivotTables: cube functions are static whereas PivotTables can grow and shrink dynamically when the data changes.

The new MakeArray function in Excel provides a really elegant solution to this problem: you can now write a single formula that returns a dynamic array with all the items in the set in. Assuming that the same CubeSet exists in B2 as shown above, you can do the following:

=MAKEARRAY(CUBESETCOUNT($B$2), 1, LAMBDA(r,c,CUBERANKEDMEMBER("ThisWorkbookDataModel",Sheet3!$B$2,r)))

Here’s the output:

Notice how the formulas in cell B4 returns an array that contains all three items in the set into the range B4:B6.

How does this work?

  1. The CubeSetCount function is used to get the number of items in the CubeSet in B2.
  2. The MakeArray function is then used to create an array with the number of rows returned by CubeSetCount and one column
  3. In the third parameter of MakeArray the Lambda function is used to return a function that wraps CubeRankedMember, which is then called with the current row number of the array

The nice thing about this is that when more products are added to the Sales table they automatically appear in the output of the MakeArray formula in B4. So, for example, with two more products added to the Sales table like so:

Here’s the new output of the formula, showing the two new products returned in the array automatically:

This is not very useful on its own though. In my next post I’ll show you how this can be used to build a simple report.

Query Pending Event In Power BI Performance Analyzer

A quick note for anyone like me who spends too much time looking at the JSON exports from Performance Analyzer in Power BI Desktop: you may have noticed an event called Query Pending that isn’t (as yet) documented in the Word doc that explains the format of these JSON files.

It turns out that it’s not that interesting – it’s an event that has been added as part of an effort to make sure there are events to cover the whole of the query lifecycle. After the DAX queries for each visual in your report are generated they are added to a queue before they are executed. In some cases there could be several queries in the queue waiting to be executed, in which case they are said to be “pending”, and the Query Pending event tells you how long a query is in this pending state.

I haven’t seen a duration of longer than a couple of milliseconds for this event though, so you probably don’t need to worry much about it. If you ever do see a long Query Pending event please leave a comment – I’m curious to know what the cause might be.

[Thanks to John Vulner and Jon Ludwig for this information]

Generating Excel Reports Using Power BI, Cube Formulas, Office Scripts And Power Automate

Now that Excel reports connected to Power BI datasets work in Excel Online it opens up a lot of new possibilities for doing cool things with Office Scripts and Power Automate. Here’s a simple example showing how all these technologies can be put together to automatically generate batches of Excel reports from a template.

Step 1: Create a template report in Excel using cube formulas

In Excel on the desktop I created a new Excel file, created a connection to a Power BI dataset and then built a simple report using Excel cube formulas:

Here are the Excel formulas for the table on the left:

This report uses data from the UK’s Land Registry (one of my favourite data sources) and shows the average price paid and number of sales broken down by property type for a single county (specified in cell B2 of this report – in the screenshot above data for Bournemouth is shown). Here’s the formula in B2:

=CUBEMEMBER("Price Paid", "[Property Transactions].[County].[All].[BOURNEMOUTH]")

This formula is referenced by all the CUBEVALUE formulas in the body of the table so they are all sliced by the selected county.

After doing this, I saved the file to OneDrive for Business.

Step 2: Create an Office Script to change the county shown in cell B2

The aim of this exercise is to generate one copy of the report above for each county in a list of counties, so the next thing I did was create a parameterised Office Script that takes the name of a county and changes the county name used in the formula in cell B2. To do this I opened the Excel report in Excel Online, started the script recorder, changed the formula in B2 and then stopped recording. I then edited this script to take a parameter for the county name (called county) to use in the formula. Here’s the script:

function main(workbook: ExcelScript.Workbook, county: string) {
	let selectedSheet = workbook.getActiveWorksheet();
	// Set range B2 on selectedSheet
	selectedSheet.getRange("B2").setFormulaLocal("=CUBEMEMBER(\"Price Paid\", \"[Property Transactions].[County].[All].[" + county + "]\")");
}

Step 3: Create a list of counties to pass to the script

Next, I created a second Excel workbook containing a table that contained the county names to pass to the script and saved this to OneDrive for Business too:

Step 4: Create Power Automate flow to call the script once for each county in the Excel table

Last of all, I created a Power Automate flow that reads the county names from the table in the previous step, runs the script for each county, creates a copy of the original Excel report after each script run and then saves it to a folder. Here’s the flow at a high level:

In more detail, here’s the setup for the ‘List rows present in a table’ action:

Here’s the ‘Run script’ action:

Here’s the expression used to get the current county name in the loop:

items('CountyLoop')?['Counties']

…and here’s the expression used to create the destination file path:

concat('/Office Scripts Reporting/Reports/',items('CountyLoop')?['Counties'],'.xlsx')

Running this flow results in three Excel workbooks being created, one for each county with the county name in the workbook name, stored in a folder like so:

Here’s the report in BATH AND NORTH EAST SOMERSET.xlsx:

Of course I could do other things at this point like email these workbooks to different people, but there’s no need to overcomplicate things – I hope you’ve got the idea.

A few last points to make:

  • Office Scripts don’t seem to work with PivotTables connected to Power BI datasets yet – I’m sure it’s just a matter of time before they do though
  • How is this different from using Power Automate to call the Power BI export API? A paginated report can be exported to Excel but this method gives you a lot more flexibility because it allows you to use a lot more Excel functionality, not jus the functionality that paginated reports can use in its exports. It also gives you a report that is connected live back to a dataset using cube functions, not static data.
  • Generating large numbers of Excel reports like this is not something I like to encourage – why not view your report in the Power BI portal, especially now you can view live Excel reports connected to datasets there too? – but I know it’s something that customers ask for .
  • I haven’t done any performance testing but I suspect that this method may be faster than using the Power BI export API in Power Automate.

Power BI Dataset Refresh, Column Encoding And The First Partition

If you’ve been following some of my recent posts about improving Power BI refresh performance by partitioning tables you will have seen a lot of screenshots that look like the one below:

It’s a visualisation from a report created by my colleague Phil Seamark (as detailed in this blog post) showing how long all the partitions in a dataset take to refresh. If you look at these visualisations you’ll probably ask the same question I did: why does the first partition always start before the others?

It turns out this is because when a table is refreshed, the first thing that has to happen is that a certain amount of data is read so the type of encoding (Value or Hash) used for each column is determined. In most cases tables only contain one partition so it’s not obvious that this is happening, but when a table has more than one partition this happens only for the first partition – which explains why the first partition seems to start before the others. You can’t avoid it happening but you can reduce the impact a little by using encoding hints (see here and here for more details): this process can be skipped for columns that have a Hash encoding hint, or which the engine knows in advance have to use Hash encoding, although it cannot be skipped for columns that have a Value encoding hint. What’s more the Execute SQL event for the first partition will have to complete before the Execute SQL events for all the other partitions can start.

[Thanks to Akshai Mirchandani for the information in this post]

Dynamic M Parameters, Snowflake Native SQL And Paginated Reports

There were a couple of new features and enhancements to existing features in the June 2021 Power BI Desktop release that don’t seem to have much to do with each other but which I think can be combined to do cool things. They are:

  1. The new paginated report visual
  2. Native SQL support in the Snowflake connector
  3. Improvements to dynamic M parameters

Let me give you an example of what I mean…

First of all, let’s start with native SQL support in the Snowflake connector. I deal with a lot of customers who use Snowflake and Power BI together and I know just how much people have wanted this. What does it allow you to do? Well, you have always been able to use the Power Query Editor to transform data coming from Snowflake in either Import mode or DirectQuery mode. Now, though, you can write your own native SQL query and use it as the source for a Power Query query (something that has always been possible with some other connectors, such as the SQL Server connector). Incidentally, this also means that the EnableFolding=true option for Value.NativeQuery that I blogged about recently also now works for Snowflake too.

The main reason you’d want to use a native SQL query when connecting to Snowflake, or indeed any database, is to do something that’s possible in SQL but not in Power Query. One example of this is to use regular expressions to filter data. I have the AdventureWorks DW DimCustomer table loaded into Snowflake and I can use Snowflake’s REGEXP function to filter on the LASTNAME column something like this:

SELECT 
DISTINCT FIRSTNAME, LASTNAME, ENGLISHOCCUPATION 
FROM "AWORKS"."PUBLIC"."DIMCUSTOMER" 
WHERE LASTNAME REGEXP 'To.*'

So that’s useful. I can use a query like this as the source of a table in DirectQuery mode in Power BI, but wouldn’t it be useful if end users of my report could change the regular expression used to filter the data? This is where dynamic M parameters come in. Assuming I have a table of pre-defined regular expressions:

And an M parameter:

…I can write an M query like this that uses the M parameter to return the regular expression used in the WHERE clause of the SQL query:

let
  Source = Value.NativeQuery(
    Snowflake.Databases(
      "mysnowflake.com", 
      "DEMO_WH"
    ){[Name = "AWORKS"]}[Data], 
    "SELECT DISTINCT FIRSTNAME, LASTNAME, ENGLISHOCCUPATION 
    FROM ""AWORKS"".""PUBLIC"".""DIMCUSTOMER"" 
    WHERE LASTNAME REGEXP '"
      & pRegEx
      & "'", 
    null, 
    [EnableFolding = true]
  )
in
  Source

…and then turn this into a dynamic M parameter in the Power BI diagram pane:

…and get a report that does this:

One limitation of dynamic M parameters in regular Power BI reports today is that the values you pass into them have to come from a column somewhere inside your dataset, so all of these values have to be pre-defined. Wouldn’t it be useful if the end user could enter any regular expression that they wanted though? That may not be possible in a regular Power BI report but it is possible with a paginated report, because with paginated reports you can write whatever DAX query you want – and therefore pass any value you want to a dynamic M parameter – and also, in a paginated report, you have the option of creating parameters where the user can enter whatever value they want.

I blogged about how to write DAX queries that contain dynamic M parameters here. Here’s an example of a parameterised DAX query (yes, I know, so many types of parameters…) that takes a regular expression and the name of an occupation and returns a table of customers whose last names match the regular expression and whose occupations match the one entered:

DEFINE
    MPARAMETER pRegEx = @DAXRegExParam
EVALUATE
FILTER (
    Customers,
    'Customers'[ENGLISHOCCUPATION] = @DAXOccupationParam
)

This can be used in a paginated report dataset connected to the Power BI dataset created above (yes, I know, so many types of datasets…) like so:

….which can then be used to build a paginated report that does this:

And of course, with the new paginated report visual, this paginated report can be embedded in a regular Power BI report:

All this is very much a proof-of-concept and not something I would recommend for production (I would be worried about SQL injection attacks for a start). There are more enhancements to these features still to come too. However, I do think it’s interesting to see how these features can be put together now and to imagine how they could be used in the future. What do you think?

Data Type Conversions For SQL Server Sources And Query Folding In Power Query

It’s surprisingly easy to stop query folding happening in Power Query by changing the data type of a column. This is mentioned in the docs here, and it’s something several people have blogged about already (for example here). However there is something new to note: an option that will allow you to convert text columns to number or date columns in a foldable way for SQL Server data sources.

Consider the following table in a SQL Server database that consists of a single nvarchar(50) column containing numeric values:

Here’s an M query that converts this column into a numeric column and which folds:

let
  Source = Sql.Databases(
    "localhost",
    [UnsafeTypeConversions = true]
  ),
  FoldingTest1 = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_NumberFoldingTest = FoldingTest1
    {
      [
        Schema = "dbo",
        Item   = "NumberFoldingTest"
      ]
    }
    [Data],
  #"Added Custom" = Table.AddColumn(
    dbo_NumberFoldingTest,
    "ConvertedNumber",
    each Number.From([NumberAsText]),
    Int64.Type
  )
in
  #"Added Custom"

Here’s the output of the query, where a new custom column called ConvertedNumber contains the converted numeric values:

Here’s the resulting SQL generated by Power Query:

select [_].[NumberAsText] as [NumberAsText],
    convert(float, [_].[NumberAsText]) as [ConvertedNumber]
from [dbo].[NumberFoldingTest] as [_]

There are three important things to point out about the M query above:

  1. I have set the (relatively new) UnsafeTypeConversions property on the Sql.Databases function to true
  2. In the custom column I have used the Number.From function to convert the text in the NumberAsText column to numbers
  3. I have used the optional third parameter of Table.AddColumn to set the data type of the new custom column to the Int64 type

All these three things are necessary to get a properly typed numeric column in your Power Query query – if you vary from this too much then folding won’t happen.

It’s also possible to use this technique to convert text to datetime values. Here’s another SQL Server table, this time with dates stored in an nvarchar(50) column:

Here’s another M query that does the conversion and folds:

let
  Source = Sql.Databases(
    "localhost",
    [UnsafeTypeConversions = true]
  ),
  FoldingTest = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_DateFoldingTest = FoldingTest
    {
      [
        Schema = "dbo",
        Item   = "DateFoldingTest"
      ]
    }
    [Data],
  #"Added Custom" = Table.AddColumn(
    dbo_DateFoldingTest,
    "ConvertedDate",
    each DateTime.From([DateAsText]),
    type datetime
  )
in
  #"Added Custom"

And here’s the resulting SQL:

select [_].[DateAsText] as [DateAsText],
    convert(datetime2, [_].[DateAsText]) as [ConvertedDate]
from [dbo].[DateFoldingTest] as [_]

Why, you ask, is this new property on Sql.Databases called “UnsafeTypeConversions”? As the name suggests, it allows you to do something that is potentially unsafe. Consider this SQL Server table that has an nvarchar(50) column containing some numeric values and one non-numeric value:

If you connect to this table and set the data type on this column to be Whole Number using the dropdown in the column header (they normal way to change the data type of a column), something like the M code below will be generated:

let
  Source = Sql.Databases("localhost"),
  FoldingTest = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_NumberFoldingErrorsTest
    = FoldingTest
    {
      [
        Schema = "dbo",
        Item = "NumberFoldingErrorsTest"
      ]
    }
    [Data],
  #"Changed Type"
    = Table.TransformColumnTypes(
    dbo_NumberFoldingErrorsTest,
    {{"MixedTextNumbers", Int64.Type}}
  )
in
  #"Changed Type"

Here’s the output of this query:

Note how this query returns four rows and the third row contains the error value shown.

If, however, you try to use the UnsafeTypeConversions approach here using something like the following M:

let
  Source = Sql.Databases(
    "localhost",
    [UnsafeTypeConversions = true]
  ),
  FoldingTest = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_NumberFoldingErrorsTest
    = FoldingTest
    {
      [
        Schema = "dbo",
        Item = "NumberFoldingErrorsTest"
      ]
    }
    [Data],
  #"Added Custom" = Table.AddColumn(
    dbo_NumberFoldingErrorsTest,
    "ConvertedToNumber",
    each Number.From([MixedTextNumbers]),
    Int64.Type
  )
in
  #"Added Custom"

You get the following result:

Notice now that there is an error value in both columns and, more importantly, only three rows are returned – the fourth has been lost. So, if you are going to use the UnsafeTypeConversions you need to be 100% sure that it will work and that you don’t have problems with your data quality.

[Thanks to Curt Hagenlocher for the information in this post]

Increasing Refresh Parallelism -And Performance – In Power BI Premium

One of the factors that affects dataset refresh performance in Power BI is the number of objects that are refreshed in parallel. At the time of writing there is a default maximum of six objects that can be refreshed in parallel in Power BI Premium but this can be increased by using custom TMSL scripts to run your refresh.

A few months ago I blogged about how partitioning a table in Power BI Premium can speed up refresh performance. The dataset I created for that post contains a single table with nine partitions, each of which is connected to a CSV file stored in ADLSgen2 storage. Using the technique described by Phil Seamark here I was able to visualise the amount of parallelism when the dataset is refreshed in a Premium Per User workspace:

In this case I started the refresh from the Power BI portal so the default parallelism settings were used. The y axis on this graph shows there were six processing slots available, which means that six objects could be refreshed in parallel – and because there are nine partitions in the only table in the dataset, this in turn meant that some slots had to refresh two partitions. Overall the dataset took 33 seconds to refresh.

However, if you connect from SQL Server Management Studio to the dataset via the workspace’s XMLA Endpoint (it’s very similar to how you connect Profiler, something I blogged about here) you can construct a TMSL script to refresh these partitions with more parallelism. You can generate a TMSL script by right-clicking on your table in the Object Explorer pane and selecting Partitions:

…then, in the Partitions dialog, selecting all the partitions and clicking the Process button (in this case ‘process’ means the same thing as ‘refresh’):

…then, on the Process Partition(s) dialog, making sure all the partitions are selected, selecting Process Full from the Mode dropdown:

…and then clicking the Script button and selecting Script Action to New Query Window:

This generates a new TMSL script with a Refresh command that refreshes all the partitions:

This needs one more change to enable more parallelism though: it needs to be wrapped in a TMSL Sequence command that contains the maxParallelism property. Here’s the snippet that goes before the refresh (you also need to close the braces after the Refresh command too):

{
"sequence":
{
"maxParallelism": 9,

Executing this command refreshed all nine partitions in parallel in nine slots:

This refresh took 25 seconds – eight seconds faster than the original refresh with six slots.

As you can see, increasing the number of refresh slots in this way can have a big impact on refresh performance – although, of course, you need to have enough tables or partitions to take advantage of any parallelism and you also need to be sure that your data source can handle increased parallelism. You can try setting MaxParallelism to any value up to 30 although no guarantees can be made about how many slots are available at any given time. It’s also worth pointing out that there are scenarios where you may want to set maxParallelism to a value that is lower than the default of six, for example to reduce to load on data sources that can’t handle many parallel queries.

[Thanks to Akshai Mirchandani for the information in this post]

Power BI/Power Query And Nullable Columns

Recently I’ve been asked by colleagues with various different types of performance problems why Power BI is generating SQL in a particular way, and the answer has been the presence of nullable columns in the underlying database – whether it’s SQL Server, Snowflake or Databricks. Now I’m not a DBA or any kind of database tuning expert so I can’t comment on why a SQL query performs the way it does on any given platform, but what I can do is show you two examples of how the presence of nullable columns changes the way Power BI and Power Query generate SQL.

Consider the following table in a SQL Server table with a single, integer column that does not allow null values:

If you connect to this table in DirectQuery mode, drag the MyNumber field into a card in a Power BI report and select the Distinct Count aggregation type:

…here’s the TSQL that is generated:

SELECT 
COUNT_BIG(DISTINCT [t0].[MyNumber])
 AS [a0]
FROM 
(
(
select [$Table].[MyNumber] as [MyNumber]
from [dbo].[NotNullableColumn] as [$Table]
)
)
 AS [t0] 

Now if you do the same thing with a table that is identical in all respects but where the MyNumber column does allow null values:

…here’s the TSQL that Power BI generates:

SELECT 
(COUNT_BIG(DISTINCT [t1].[MyNumber]) 
+ MAX(CASE WHEN [t1].[MyNumber] IS NULL THEN 1 ELSE 0 END))
 AS [a0]
FROM 
(
(
select [$Table].[MyNumber] as [MyNumber]
from [dbo].[NullableColumn] as [$Table]
)
)
 AS [t1] 

Notice the extra code in the third line of this second query that has been added to handle the possible presence of null values.

It’s not just when you’re using DirectQuery mode that you can see a difference. Let’s say you’re using Import mode and you take each of these tables and join them to themselves in the Power Query Editor like so:

Here’s the M code for this query:

let
  Source = Sql.Databases("localhost"),
  FoldingTest = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_NotNullableColumn = FoldingTest
    {
      [
        Schema = "dbo",
        Item   = "NotNullableColumn"
      ]
    }
    [Data],
  #"Merged Queries" = Table.NestedJoin(
    dbo_NotNullableColumn,
    {"MyNumber"},
    dbo_NotNullableColumn,
    {"MyNumber"},
    "dbo_NotNullableColumn",
    JoinKind.Inner
  ),
  #"Expanded dbo_NotNullableColumn"
    = Table.ExpandTableColumn(
    #"Merged Queries",
    "dbo_NotNullableColumn",
    {"MyNumber"},
    {"dbo_NotNullableColumn.MyNumber"}
  )
in
  #"Expanded dbo_NotNullableColumn"

Joining the table with the not nullable column to itself folds and results in the following TSQL query being generated:

select [$Outer].[MyNumber] as [MyNumber],
    [$Inner].[MyNumber2] as [dbo_NotNullableColumn.MyNumber]
from [dbo].[NotNullableColumn] as [$Outer]
inner join 
(
    select [_].[MyNumber] as [MyNumber2]
    from [dbo].[NotNullableColumn] as [_]
) as [$Inner] on ([$Outer].[MyNumber] = [$Inner].[MyNumber2])

If you do the same thing with the table with the nullable column, here’s the TSQL that is generated:

select [$Outer].[MyNumber] as [MyNumber],
    [$Inner].[MyNumber2] as [dbo_NullableColumn.MyNumber]
from [dbo].[NullableColumn] as [$Outer]
inner join 
(
    select [_].[MyNumber] as [MyNumber2]
    from [dbo].[NullableColumn] as [_]
) as [$Inner] on ([$Outer].[MyNumber] = [$Inner].[MyNumber2] 
or [$Outer].[MyNumber] is null and [$Inner].[MyNumber2] is null)

Once again you can see how the SQL generated for an operation on a nullable column is different to the SQL generated for an operation on a non-nullable column. Whether one SQL query performs significantly better or worse than the other is something you need to test.

The last thing to say is that there is no supported way in Power BI or Power Query to treat a nullable column as if it was not nullable. If you have a nullable column and the extra SQL to handle those nulls results in a performance problem then your only option is to alter the design of your table and make the column not nullable.

Speed Up Power Query In Power BI Desktop By Increasing Or Decreasing The Number Of Evaluation Containers

Last week I showed how the new MaxEvaluationWorkingSetInMB registry setting could increase the performance of memory-hungry Power Query queries in Power BI Desktop. In this post I’ll show how the other new registry setting, ForegroundEvaluationContainerCount, can also help performance. Before I carry on I recommend you read the documentation on these new registry settings if you haven’t done so already.

To illustrate the effect of this setting I created ten identical Power Query queries feeding an Import mode dataset in a new .pbix file, each of which read data from the same 150MB CSV file, apply the a filter and then count the number of rows returned. These queries don’t require a large amount of memory but do take a couple of seconds to execute:

With ForegroundEvaluationContainerCount not set, refreshing the entire dataset (with background queries disabled) initially showed ten active evaluation containers:

I’m pretty sure these containers were used to determine the schemas of the tables returned (see here for more background); these were then joined by ten more containers which I assume were actually used by the refresh:

With these default settings refresh took 18 seconds according to Profiler.

With ForegroundEvaluationContainerCount set to 3:

This time there were never more than three evaluation containers active at any one time:

…and refresh took 24 seconds.

So we’ve proved that by setting ForegroundEvaluationContainerCount to a low value we can limit the amount of parallelism and, in this case, make performance worse. So why would you ever want to limit the amount of parallelism like this? The maximum amount of memory available to an evaluation container isn’t just controlled by the MaxEvaluationWorkingSetInMB registry setting; as the docs say, the effective maximum is also determined by the number of evaluation containers used. So reducing the amount of parallelism can increase the amount of memory available to each evaluation container and possibly increase performance.

I then created twenty new copies of the Power Query query, bringing the total number of queries in the pbix file to thirty, and set removed the ForegroundEvaluationContainerCount registry key to go back to using the default settings. During refresh I saw that no more than twenty evaluation containers were active – as expected, because the docs state that with the default settings no more than twenty containers will be used. I’ll spare you the screenshot. Refresh took 62 seconds.

Then I set ForegroundEvaluationContainerCount to 30 and refreshed. This time I could see thirty evaluation containers being used during refresh, and refresh took 55 seconds – not a massive improvement, but an improvement that I’m pretty sure can be attributed to the increased parallelism (I suspect that there was some other bottleneck here, possibly IO).

In conclusion the ForegroundEvaluationContainerCount registry setting is another useful tool to improve refresh performance for Import mode datasets (it’s also useful for DirectQuery but that’s something for a future post) in Power BI Desktop. Finding the optimal value to set it too is not straightforward though and is likely to involve a lot of experimentation. As always, please let me know how you get on using it.

%d bloggers like this: