Data Type Conversions For SQL Server Sources And Query Folding In Power Query

It’s surprisingly easy to stop query folding happening in Power Query by changing the data type of a column. This is mentioned in the docs here, and it’s something several people have blogged about already (for example here). However there is something new to note: an option that will allow you to convert text columns to number or date columns in a foldable way for SQL Server data sources.

Consider the following table in a SQL Server database that consists of a single nvarchar(50) column containing numeric values:

Here’s an M query that converts this column into a numeric column and which folds:

let
  Source = Sql.Databases(
    "localhost",
    [UnsafeTypeConversions = true]
  ),
  FoldingTest1 = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_NumberFoldingTest = FoldingTest1
    {
      [
        Schema = "dbo",
        Item   = "NumberFoldingTest"
      ]
    }
    [Data],
  #"Added Custom" = Table.AddColumn(
    dbo_NumberFoldingTest,
    "ConvertedNumber",
    each Number.From([NumberAsText]),
    Int64.Type
  )
in
  #"Added Custom"

Here’s the output of the query, where a new custom column called ConvertedNumber contains the converted numeric values:

Here’s the resulting SQL generated by Power Query:

select [_].[NumberAsText] as [NumberAsText],
    convert(float, [_].[NumberAsText]) as [ConvertedNumber]
from [dbo].[NumberFoldingTest] as [_]

There are three important things to point out about the M query above:

  1. I have set the (relatively new) UnsafeTypeConversions property on the Sql.Databases function to true
  2. In the custom column I have used the Number.From function to convert the text in the NumberAsText column to numbers
  3. I have used the optional third parameter of Table.AddColumn to set the data type of the new custom column to the Int64 type

All these three things are necessary to get a properly typed numeric column in your Power Query query – if you vary from this too much then folding won’t happen.

It’s also possible to use this technique to convert text to datetime values. Here’s another SQL Server table, this time with dates stored in an nvarchar(50) column:

Here’s another M query that does the conversion and folds:

let
  Source = Sql.Databases(
    "localhost",
    [UnsafeTypeConversions = true]
  ),
  FoldingTest = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_DateFoldingTest = FoldingTest
    {
      [
        Schema = "dbo",
        Item   = "DateFoldingTest"
      ]
    }
    [Data],
  #"Added Custom" = Table.AddColumn(
    dbo_DateFoldingTest,
    "ConvertedDate",
    each DateTime.From([DateAsText]),
    type datetime
  )
in
  #"Added Custom"

And here’s the resulting SQL:

select [_].[DateAsText] as [DateAsText],
    convert(datetime2, [_].[DateAsText]) as [ConvertedDate]
from [dbo].[DateFoldingTest] as [_]

Why, you ask, is this new property on Sql.Databases called “UnsafeTypeConversions”? As the name suggests, it allows you to do something that is potentially unsafe. Consider this SQL Server table that has an nvarchar(50) column containing some numeric values and one non-numeric value:

If you connect to this table and set the data type on this column to be Whole Number using the dropdown in the column header (they normal way to change the data type of a column), something like the M code below will be generated:

let
  Source = Sql.Databases("localhost"),
  FoldingTest = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_NumberFoldingErrorsTest
    = FoldingTest
    {
      [
        Schema = "dbo",
        Item = "NumberFoldingErrorsTest"
      ]
    }
    [Data],
  #"Changed Type"
    = Table.TransformColumnTypes(
    dbo_NumberFoldingErrorsTest,
    {{"MixedTextNumbers", Int64.Type}}
  )
in
  #"Changed Type"

Here’s the output of this query:

Note how this query returns four rows and the third row contains the error value shown.

If, however, you try to use the UnsafeTypeConversions approach here using something like the following M:

let
  Source = Sql.Databases(
    "localhost",
    [UnsafeTypeConversions = true]
  ),
  FoldingTest = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_NumberFoldingErrorsTest
    = FoldingTest
    {
      [
        Schema = "dbo",
        Item = "NumberFoldingErrorsTest"
      ]
    }
    [Data],
  #"Added Custom" = Table.AddColumn(
    dbo_NumberFoldingErrorsTest,
    "ConvertedToNumber",
    each Number.From([MixedTextNumbers]),
    Int64.Type
  )
in
  #"Added Custom"

You get the following result:

Notice now that there is an error value in both columns and, more importantly, only three rows are returned – the fourth has been lost. So, if you are going to use the UnsafeTypeConversions you need to be 100% sure that it will work and that you don’t have problems with your data quality.

[Thanks to Curt Hagenlocher for the information in this post]

Increasing Refresh Parallelism -And Performance – In Power BI Premium

One of the factors that affects dataset refresh performance in Power BI is the number of objects that are refreshed in parallel. At the time of writing there is a default maximum of six objects that can be refreshed in parallel in Power BI Premium but this can be increased by using custom TMSL scripts to run your refresh.

A few months ago I blogged about how partitioning a table in Power BI Premium can speed up refresh performance. The dataset I created for that post contains a single table with nine partitions, each of which is connected to a CSV file stored in ADLSgen2 storage. Using the technique described by Phil Seamark here I was able to visualise the amount of parallelism when the dataset is refreshed in a Premium Per User workspace:

In this case I started the refresh from the Power BI portal so the default parallelism settings were used. The y axis on this graph shows there were six processing slots available, which means that six objects could be refreshed in parallel – and because there are nine partitions in the only table in the dataset, this in turn meant that some slots had to refresh two partitions. Overall the dataset took 33 seconds to refresh.

However, if you connect from SQL Server Management Studio to the dataset via the workspace’s XMLA Endpoint (it’s very similar to how you connect Profiler, something I blogged about here) you can construct a TMSL script to refresh these partitions with more parallelism. You can generate a TMSL script by right-clicking on your table in the Object Explorer pane and selecting Partitions:

…then, in the Partitions dialog, selecting all the partitions and clicking the Process button (in this case ‘process’ means the same thing as ‘refresh’):

…then, on the Process Partition(s) dialog, making sure all the partitions are selected, selecting Process Full from the Mode dropdown:

…and then clicking the Script button and selecting Script Action to New Query Window:

This generates a new TMSL script with a Refresh command that refreshes all the partitions:

This needs one more change to enable more parallelism though: it needs to be wrapped in a TMSL Sequence command that contains the maxParallelism property. Here’s the snippet that goes before the refresh (you also need to close the braces after the Refresh command too):

{
"sequence":
{
"maxParallelism": 9,

Executing this command refreshed all nine partitions in parallel in nine slots:

This refresh took 25 seconds – eight seconds faster than the original refresh with six slots.

As you can see, increasing the number of refresh slots in this way can have a big impact on refresh performance – although, of course, you need to have enough tables or partitions to take advantage of any parallelism and you also need to be sure that your data source can handle increased parallelism. You can try setting MaxParallelism to any value up to 30 although no guarantees can be made about how many slots are available at any given time. It’s also worth pointing out that there are scenarios where you may want to set maxParallelism to a value that is lower than the default of six, for example to reduce to load on data sources that can’t handle many parallel queries.

[Thanks to Akshai Mirchandani for the information in this post]

Power BI/Power Query And Nullable Columns

Recently I’ve been asked by colleagues with various different types of performance problems why Power BI is generating SQL in a particular way, and the answer has been the presence of nullable columns in the underlying database – whether it’s SQL Server, Snowflake or Databricks. Now I’m not a DBA or any kind of database tuning expert so I can’t comment on why a SQL query performs the way it does on any given platform, but what I can do is show you two examples of how the presence of nullable columns changes the way Power BI and Power Query generate SQL.

Consider the following table in a SQL Server table with a single, integer column that does not allow null values:

If you connect to this table in DirectQuery mode, drag the MyNumber field into a card in a Power BI report and select the Distinct Count aggregation type:

…here’s the TSQL that is generated:

SELECT 
COUNT_BIG(DISTINCT [t0].[MyNumber])
 AS [a0]
FROM 
(
(
select [$Table].[MyNumber] as [MyNumber]
from [dbo].[NotNullableColumn] as [$Table]
)
)
 AS [t0] 

Now if you do the same thing with a table that is identical in all respects but where the MyNumber column does allow null values:

…here’s the TSQL that Power BI generates:

SELECT 
(COUNT_BIG(DISTINCT [t1].[MyNumber]) 
+ MAX(CASE WHEN [t1].[MyNumber] IS NULL THEN 1 ELSE 0 END))
 AS [a0]
FROM 
(
(
select [$Table].[MyNumber] as [MyNumber]
from [dbo].[NullableColumn] as [$Table]
)
)
 AS [t1] 

Notice the extra code in the third line of this second query that has been added to handle the possible presence of null values.

It’s not just when you’re using DirectQuery mode that you can see a difference. Let’s say you’re using Import mode and you take each of these tables and join them to themselves in the Power Query Editor like so:

Here’s the M code for this query:

let
  Source = Sql.Databases("localhost"),
  FoldingTest = Source
    {[Name = "FoldingTest"]}
    [Data],
  dbo_NotNullableColumn = FoldingTest
    {
      [
        Schema = "dbo",
        Item   = "NotNullableColumn"
      ]
    }
    [Data],
  #"Merged Queries" = Table.NestedJoin(
    dbo_NotNullableColumn,
    {"MyNumber"},
    dbo_NotNullableColumn,
    {"MyNumber"},
    "dbo_NotNullableColumn",
    JoinKind.Inner
  ),
  #"Expanded dbo_NotNullableColumn"
    = Table.ExpandTableColumn(
    #"Merged Queries",
    "dbo_NotNullableColumn",
    {"MyNumber"},
    {"dbo_NotNullableColumn.MyNumber"}
  )
in
  #"Expanded dbo_NotNullableColumn"

Joining the table with the not nullable column to itself folds and results in the following TSQL query being generated:

select [$Outer].[MyNumber] as [MyNumber],
    [$Inner].[MyNumber2] as [dbo_NotNullableColumn.MyNumber]
from [dbo].[NotNullableColumn] as [$Outer]
inner join 
(
    select [_].[MyNumber] as [MyNumber2]
    from [dbo].[NotNullableColumn] as [_]
) as [$Inner] on ([$Outer].[MyNumber] = [$Inner].[MyNumber2])

If you do the same thing with the table with the nullable column, here’s the TSQL that is generated:

select [$Outer].[MyNumber] as [MyNumber],
    [$Inner].[MyNumber2] as [dbo_NullableColumn.MyNumber]
from [dbo].[NullableColumn] as [$Outer]
inner join 
(
    select [_].[MyNumber] as [MyNumber2]
    from [dbo].[NullableColumn] as [_]
) as [$Inner] on ([$Outer].[MyNumber] = [$Inner].[MyNumber2] 
or [$Outer].[MyNumber] is null and [$Inner].[MyNumber2] is null)

Once again you can see how the SQL generated for an operation on a nullable column is different to the SQL generated for an operation on a non-nullable column. Whether one SQL query performs significantly better or worse than the other is something you need to test.

The last thing to say is that there is no supported way in Power BI or Power Query to treat a nullable column as if it was not nullable. If you have a nullable column and the extra SQL to handle those nulls results in a performance problem then your only option is to alter the design of your table and make the column not nullable.

Speed Up Power Query In Power BI Desktop By Increasing Or Decreasing The Number Of Evaluation Containers

Last week I showed how the new MaxEvaluationWorkingSetInMB registry setting could increase the performance of memory-hungry Power Query queries in Power BI Desktop. In this post I’ll show how the other new registry setting, ForegroundEvaluationContainerCount, can also help performance. Before I carry on I recommend you read the documentation on these new registry settings if you haven’t done so already.

To illustrate the effect of this setting I created ten identical Power Query queries feeding an Import mode dataset in a new .pbix file, each of which read data from the same 150MB CSV file, apply the a filter and then count the number of rows returned. These queries don’t require a large amount of memory but do take a couple of seconds to execute:

With ForegroundEvaluationContainerCount not set, refreshing the entire dataset (with background queries disabled) initially showed ten active evaluation containers:

I’m pretty sure these containers were used to determine the schemas of the tables returned (see here for more background); these were then joined by ten more containers which I assume were actually used by the refresh:

With these default settings refresh took 18 seconds according to Profiler.

With ForegroundEvaluationContainerCount set to 3:

This time there were never more than three evaluation containers active at any one time:

…and refresh took 24 seconds.

So we’ve proved that by setting ForegroundEvaluationContainerCount to a low value we can limit the amount of parallelism and, in this case, make performance worse. So why would you ever want to limit the amount of parallelism like this? The maximum amount of memory available to an evaluation container isn’t just controlled by the MaxEvaluationWorkingSetInMB registry setting; as the docs say, the effective maximum is also determined by the number of evaluation containers used. So reducing the amount of parallelism can increase the amount of memory available to each evaluation container and possibly increase performance.

I then created twenty new copies of the Power Query query, bringing the total number of queries in the pbix file to thirty, and set removed the ForegroundEvaluationContainerCount registry key to go back to using the default settings. During refresh I saw that no more than twenty evaluation containers were active – as expected, because the docs state that with the default settings no more than twenty containers will be used. I’ll spare you the screenshot. Refresh took 62 seconds.

Then I set ForegroundEvaluationContainerCount to 30 and refreshed. This time I could see thirty evaluation containers being used during refresh, and refresh took 55 seconds – not a massive improvement, but an improvement that I’m pretty sure can be attributed to the increased parallelism (I suspect that there was some other bottleneck here, possibly IO).

In conclusion the ForegroundEvaluationContainerCount registry setting is another useful tool to improve refresh performance for Import mode datasets (it’s also useful for DirectQuery but that’s something for a future post) in Power BI Desktop. Finding the optimal value to set it too is not straightforward though and is likely to involve a lot of experimentation. As always, please let me know how you get on using it.

What The New Visio Web App And Licensing Announcement Means For Power BI

There was an interesting announcement today regarding Visio:

https://www.microsoft.com/en-us/microsoft-365/blog/2021/06/09/bringing-visio-to-microsoft-365-diagramming-for-everyone/

In summary there will soon be a lightweight, web-based version of Visio available to anyone with a Microsoft 365 Business, Office 365 E1/E3/E5, F3, A1, A3 or A5 subscription. Previously Visio was not part of the main M365 plans and was only available as a separate purchase.

So what? As a Power BI user, why should I care? Well the Visio custom visual for Power BI has been around a long time now and it’s really powerful. Unfortunately it’s very rarely used because Power BI developers don’t usually have Visio licences – but this is exactly what is about to change. With these licensing changes pretty much everyone who uses Power BI will have access to the new lightweight Visio web app. It’s not as sophisticated as desktop Visio but I’ll be honest, I’m no Visio expert and it’s good enough for me and really easy to use. As a result this is going to unlock the power of the Power BI Visio visual for a much, much larger number of people!

To get an idea of what you can do with the Power BI Visio visual, this video is a good place to start:

Speed Up Power Query In Power BI Desktop By Allocating More Memory To Evaluation Containers

A really useful new Power Query performance enhancement was added to Power BI Desktop in an update to the May release via the Microsoft Store a week or so ago (if you’re not installing Power BI Desktop through the Microsoft Store you’ll have to wait for the June release I’m afraid). You can read the documentation here:

https://docs.microsoft.com/en-us/power-bi/create-reports/desktop-evaluation-configuration

However if you have just read the docs you may be wondering what these two new registry key settings actually do. In this post I’m only going to talk about one, MaxEvaluationWorkingSetInMB; I’ll leave ForegroundEvaluationContainerCount for a future post.

At various times in the past I have blogged about how, when you run a Power Query query, the query itself is executed inside a separate process called an evaluation (or mashup) container and how this process has a limit on the amount of memory it can use. Some transformations such as sorting a table, doing a group by, pivoting and unpivoting require an entire table of data to be held in memory and if these operations require more memory than the evaluation container is able to use then it starts paging and query performance gets a lot worse. This post provides more details:

https://blog.crossjoin.co.uk/2020/05/21/monitoring-power-query-memory-usage-with-query-diagnostics-in-power-bi/

Two things have now changed though. First of all, the default of amount of memory available to an evaluation container in Power BI Desktop has been increased from 256MB to 432MB. This on its own will make many Power Query queries run a lot faster. Secondly, it is now possible to define how much memory an evaluation container can use yourself via the new MaxEvaluationWorkingSetInMB registry setting described in the documentation.

Here’s an example that shows how much of an impact this can have. In Power BI Desktop I created a Power Query query that reads data from a csv file with around one million rows in it and then sorts the resulting table by the values in one column:

let
  Source = Csv.Document(
    File.Contents("C:\demo.csv"), 
    [
      Delimiter  = ",", 
      Columns    = 16, 
      Encoding   = 1252, 
      QuoteStyle = QuoteStyle.None
    ]
  ), 
  #"Sorted Rows" = Table.Sort(
    Source, 
    {{"Column2", Order.Ascending}}
  )
in
  #"Sorted Rows"

Using SQL Server Profiler in the way described here, I found that the Power Query query took almost 87 seconds to start returning data and a further 19 seconds to return all the data:

What’s more, in Task Manager I could see that the evaluation container doing the work was limited to using around 423MB of RAM:

I then used Regedit to set MaxEvaluationWorkingSetInMB to 4096, giving each evaluation container a maximum of 4GB of RAM to use:

After restarting Desktop I reran the same query. This time Task Manager showed the evaluation container doing the work using around 1.2GB of RAM:

…and Profiler showed that the query started returning data after only 14 seconds and returned all the data in a further 12 seconds:

As you can see, that’s a massive performance improvement. Before you get too excited about this, though, a few things need to be made clear.

First, this setting only affects the performance of Power Query queries in Power BI Desktop. It does not affect the performance of queries in the Power BI Service, although there is another setting that (I think) will have the same effect for queries that go through an on-premises data gateway – but that’s yet another for a future post. So while this will make development much quicker and easier it won’t make dataset refreshes in the Power BI Service quicker.

Second, you need to be very careful when changing this setting. There’s no safety net here – you can set MaxEvaluationWorkingSetInMB to whatever value you want – and so some care is needed. When a dataset is refreshed then multiple evaluation containers may be used to handle the Power Query transformations, each of which can use the amount of memory specified by MaxEvaluationWorkingSetInMB. Since there’s a finite amount of memory on your development PC it’s important you don’t set MaxEvaluationWorkingSetInMB too high because if you do there’s a risk that Power BI will try to use more memory than you have available and bring your PC to a grinding halt. What’s more there’s no way of knowing how much memory any given query will need without some experimentation, so my advice is that if you do change MaxEvaluationWorkingSetInMB you should only increase it by a small amount and then increase it only if you are sure you need it.

I’d love to hear how much changing this setting improves the performance of your queries. If it does prove to be useful to a large number of people I hope we can get it added to the Options dialog in Power BI Desktop (which is much more convenient than changing a registry key); I also think it would be very useful in Excel Power Query. Please leave a comment with your findings!

%d bloggers like this: