Cross Database Query Folding For SQL Server Sources In Power Query In Power BI And Excel

A few months ago a new option was added to the Sql.Database and Sql.Databases functions in Power Query in Power BI and Excel which allows Power Query queries that combine data from different SQL Server databases to fold. Here’s a simple example showing how to use it.

On my local PC I have SQL Server installed and the Adventure Works DW 2017 and Contoso Retail DW sample databases:

Both of these databases have date dimension tables called DimDate. Let’s say you want to create a Power Query query that merges these two tables.

Here’s the M code for a Power Query query called DimDate AW to get just the DateKey and CalendarYear columns from the DimDate table of the Adventure Works DW 2017 database:

let
    Source = Sql.Database("localhost", "AdventureWorksDW2017"),
    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
    #"Removed Other Columns" = Table.SelectColumns(dbo_DimDate,{"DateKey", "CalendarYear"})
in
    #"Removed Other Columns"

Here’s the M code for a Power Query query called DimDate Contoso to get just the Datekey and CalendarYear columns from the DimDate table in the ContosoRetailDW database:

let
    Source = Sql.Database("localhost", "ContosoRetailDW"),
    dbo_DimDate = Source{[Schema="dbo",Item="DimDate"]}[Data],
    #"Removed Other Columns" = Table.SelectColumns(dbo_DimDate,{"Datekey", "CalendarYear"})
in
    #"Removed Other Columns"

Both of these Power Query queries fold. However if you create a third query to merge these two queries (ie do the equivalent of a SQL join between them) on the CalendarYear columns like so:

let
  Source = Table.NestedJoin(
    #"DimDate Contoso",
    {"CalendarYear"},
    #"DimDate AW",
    {"CalendarYear"},
    "DimDate AW",
    JoinKind.LeftOuter
  ),
  #"Expanded DimDate AW" = Table.ExpandTableColumn(
    Source,
    "DimDate AW",
    {"DateKey", "CalendarYear"},
    {"DimDate AW.DateKey", "DimDate AW.CalendarYear"}
  )
in
  #"Expanded DimDate AW"

…this query does not fold, because it combines data from two different SQL Server databases.

However if you edit the Sql.Database function in the Source step of both of the first two queries above to set the new EnableCrossDatabaseFolding option to true, like so:

    Source = 
    Sql.Database(
        "localhost", 
        "ContosoRetailDW", 
        [EnableCrossDatabaseFolding=true]
        ),

…then the query that merges these two queries does fold:

Avoiding The “Maximum Allowed Size” Error In Power BI DirectQuery Mode With Aggregations On Degenerate Dimensions

Back in December I wrote a post explaining why you may encounter the “The resultset of a query to external data source has exceeded the maximum allowed size” error when working with degenerate dimensions in DirectQuery mode in Power BI. In that post I showed an example of how a moderately complex DAX expression in a measure can trigger the error even when you’re applying a filter in a visual; in this post I’ll show how you can use aggregations to avoid that error.

A quick recap: say you have a dataset built from the ContosoRetailDW sample database with a Date dimension table and a fact table called FactOnlineSales with more than 12 million rows in it.

There are two measures:

Sales Amount = SUM(FactOnlineSales[SalesAmount] )

Error Demo = 
var s = [Sales Amount]
return if(s>0, s)

If you build a table visual with the SalesOrderNumber column (a degenerate dimension from the fact table with the same granularity as the fact table) on rows and just the [Sales Amount] measure in, and you filter to a single date, then everything works ok:

…but if you add the [Error Demo] measure too a SQL query is generated to get all the values from the SalesOrderNumber column which returns more than a million rows and triggers the error:

For more details please see the original post.

Since the error is caused by a DirectQuery query to get all the values from the SalesOrderNumber column, one workaround is to build an Import mode aggregation table that contains all the distinct values from that column.

It was easy to do this in my case just using Power Query – I just duplicated the FactOnlineSales query, removed all columns apart from SalesOrderNumber, and did a “Remove Duplicates” transform. This left me with a new table containing just one column, SalesOrderNumber, that I could load into my dataset using Import mode:

I then configured this new table as an aggregation table for the FactOnlineSales table, with the SalesOrderNumber column from the new table set as a GroupBy on the SalesOrderNumber column from FactOnlineSales:

With this aggregation table configured, the table with both measures in no longer gives an error:

A Profiler trace shows that the query to get the data needed for the table is still in DirectQuery mode, but the “dimension query” to get all the distinct values from SalesOrderNumber now hits the Import mode aggregation:

I know what you’re going to say though: “If I’m using DirectQuery for my fact table I don’t want to build Import mode aggregations!”. Well yes, there are some limitations to point out with this approach. In my opinion it will work well if you are using DirectQuery mode because you have very large fact tables – even a few billion rows – but your data doesn’t change very frequently (say only once a day). In that scenario refreshing an aggregation table containing just a single column could be very fast and take up a relatively small amount of memory in Power BI, at least in comparison with an Import mode table containing all the columns from the fact table. Using incremental refresh on the aggregation table will also help but unfortunately you can’t use a hybrid table as an aggregation table at the time of writing this post, so you can’t mix Import mode and DirectQuery for the aggregation table. On the other hand if you’re using DirectQuery because your data changes frequently during the day then I don’t think this approach will work because it will be impossible to keep the contents of your Import mode aggregation table in sync with the contents of your DirectQuery fact table.

Power BI DirectQuery Mode And Degenerate Dimensions

A few weeks ago my colleague Dany Hoter wrote a post on the Azure Data Explorer blog about how using columns on fact tables as dimensions in DirectQuery mode can lead to errors in Power BI. You can read it here:

https://techcommunity.microsoft.com/t5/azure-data-explorer-blog/to-star-or-not-to-star-more-about-data-modeling-in-power-bi/ba-p/3689933

In the post he mentioned that he could reproduce the same behaviour in SQL Server, so I thought it would be good to show an example of this to raise awareness of the issue because I think it’s one that anyone using DirectQuery mode on any data source is likely to run into.

Consider a simple DirectQuery dataset built from the ContosoRetailDW SQL Server sample database:

There are two important things to point out here:

  1. The FactOnlineSales table has over 12 million rows
  2. The SalesOrderNumber column on the FactOnlineSales table is an example of a degenerate dimension, a dimension key in the fact table that does not have a related dimension table. This typically happens when it would make no sense to create a separate dimension table because it would be the same, or almost the same, granularity as the fact table.

Now if you try to build a simple table visual in a report connected to this dataset that has the SalesOrderNumber column and a measure that sums up the value of the SalesAmount column (ie with a definition of SUM(FactOnlineSales[SalesAmount] ) ) you’ll get the following error:

The resultset of a query to external data source has exceeded the maximum allowed size of ‘1000000’ rows.

This isn’t really surprising: you’ll get this error in DirectQuery mode any time Power BI generates a query against a data source that returns more than one million rows. You can increase this threshold using the Max Intermediate Row Set Count property on a Premium capacity but to be honest, if Power BI is trying to get this much data, you could still run into other performance or memory problems so the best thing to do is to redesign your report to avoid this problem. In this particular example you should question why you need a report with a table that has more than one million rows in it (I suspect the answer from the business will be “to export it to Excel”, which is another argument…).

In many cases you can avoid the error by applying a filter to the visual. In my example filtering the visual to display just the data for January 1st 2007, by applying the filter on the Datekey column of the DimDate table, means you don’t get an error because no SQL query will be generated that returns more than one million rows:

However, the bad news is that if you use slightly more complex DAX in your measure you are still likely run into the same error. For example, if you create a new measure with the following definition:

Error Demo = 
var s = [Sales Amount]
return if(s>0, s)

[Yes, I know the measure definition doesn’t make much sense but it’s just a way of reproducing the problem]

….you’ll get the error even with the filter applied:

Why is this? Looking at the various SQL queries generated by Power BI for this visual it’s easy to find the cause of the error:

SELECT 
TOP (1000001) [t1].[SalesOrderNumber]
FROM 
(
(
select [$Table].[OnlineSalesKey] as [OnlineSalesKey],
    [$Table].[DateKey] as [DateKey],
    [$Table].[StoreKey] as [StoreKey],
    [$Table].[ProductKey] as [ProductKey],
    [$Table].[PromotionKey] as [PromotionKey],
    [$Table].[CurrencyKey] as [CurrencyKey],
    [$Table].[CustomerKey] as [CustomerKey],
    [$Table].[SalesOrderNumber] as [SalesOrderNumber],
    [$Table].[SalesOrderLineNumber] as [SalesOrderLineNumber],
    [$Table].[SalesQuantity] as [SalesQuantity],
    [$Table].[SalesAmount] as [SalesAmount],
    [$Table].[ReturnQuantity] as [ReturnQuantity],
    [$Table].[ReturnAmount] as [ReturnAmount],
    [$Table].[DiscountQuantity] as [DiscountQuantity],
    [$Table].[DiscountAmount] as [DiscountAmount],
    [$Table].[TotalCost] as [TotalCost],
    [$Table].[UnitCost] as [UnitCost],
    [$Table].[UnitPrice] as [UnitPrice],
    [$Table].[ETLLoadID] as [ETLLoadID],
    [$Table].[LoadDate] as [LoadDate],
    [$Table].[UpdateDate] as [UpdateDate]
from [dbo].[FactOnlineSales] as [$Table]
)
)
 AS [t1]
GROUP BY [t1].[SalesOrderNumber] 

Why is this query generated? Why is there no WHERE clause applying the date filter? When a DAX query is run Power BI does the following:

  1. Send dimension queries per unique dimension table. Only filters directly placed on the dimension table apply in this case.
  2. Send measure queries for each measure. All filters may apply in this case.
  3. Join the results from 1 and 2.

Sometimes, as in the example where just the SUM measure is used, it can optimise the dimension queries and apply the filter from the dimension table too. Sometimes, as in the last example above, it can’t and it runs a dimension query asking for all the distinct values from SalesOrderNumber in the fact table. You can try to work around it by rewriting your DAX, by filtering on the fact table and not the dimension table – applying a filter on the Datekey column of FactOnlineSales for 1/1/2007, rather than on the Datekey column of DimDate, results in no error – or using a dynamic M parameter to get more control over the queries generated. However the best course of action is to avoid using any column from a large fact table as a dimension, whether it’s a legitimate degenerate dimension or a column that should really be modelled in a separate dimension table.

[Thanks to Dany Hoter and Jeffrey Wang for the information in this post]

UPDATE February 2023: in some cases you may be able to avoid this error by building an aggregation table https://blog.crossjoin.co.uk/2023/02/08/avoiding-the-maximum-allowed-size-error-in-power-bi-directquery-mode-with-aggregations-on-degenerate-dimensions/

Why Not All SQL Queries Can Be Used As A Source In Power BI DirectQuery Mode

A few days ago I was tagged on a thread on Twitter: my old pal Mim was upset to find that he couldn’t use a SQL query with a Common Table Expression as the source of a table in DirectQuery mode. He worked out why without my help but I thought that it was worth me writing up an explanation for other people who might be struggling with the same problem.

In Power BI DirectQuery mode you can – for most data sources – use your own SQL query as the source for a table in your dataset. For example when connecting to the AdventureWorksDW 2017 sample database in SQL Server using DirectQuery mode, I can use the following SQL query as the source of a table:

SELECT 
FullDateAlternateKey, 
EnglishDayNameOfWeek, 
EnglishMonthName
FROM DimDate

Now, let’s say I use this table in a dataset and build the following report with two slicers and a table, and with a month and a day name selected in the slicers:

The SQL generated by Power BI to populate the table looks like this:

SELECT 
TOP (501) [t0].[FullDateAlternateKey]
FROM 
(
(
SELECT 
FullDateAlternateKey, 
EnglishDayNameOfWeek, 
EnglishMonthName
FROM DimDate
)
)
 AS [t0]
WHERE 
(
(
[t0].[EnglishDayNameOfWeek] = N'Sunday'
)
 AND 
(
[t0].[EnglishMonthName] = N'April'
)
)

GROUP BY [t0].[FullDateAlternateKey],[t0].[EnglishDayNameOfWeek],[t0].[EnglishMonthName]
ORDER BY [t0].[EnglishDayNameOfWeek]
ASC
,[t0].[EnglishMonthName]
ASC
,[t0].[FullDateAlternateKey]
ASC

Notice how the original SQL query I used as the source of the table is used as a subquery in this query. This is how Power BI generates its SQL and this is why there are limitations on the SQL you can use in DirectQuery mode: not all SQL queries that run standalone can be used in subqueries.

Take the following variation on the SQL above for example:

SELECT
FullDateAlternateKey,
EnglishDayNameOfWeek,
EnglishMonthName
FROM DimDate
ORDER BY
EnglishMonthName

It’s the same query as before but with an ORDER BY clause. It works fine in the Power Query Editor but when you go to your report you’ll see the following error:

“Microsoft SQL: The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries and common table expressions unless TOP, OFFSET or FOR XML is also specified”

This is the error message generated by the SQL query that Power BI is trying to run – and as you will have guessed by now, it’s being generated because Power BI is trying to use your query with an ORDER BY clause inside its own subquery. You can’t put an ORDER BY clause in a subquery in TSQL except in the scenarios mentioned in the error message. The same goes for Common Table Expressions.

What can you do to work around this? There is really just one answer: model your data the way Power BI likes it (ie as a star schema) before it gets to Power BI. It’s the now-famous Roche’s Maxim. Even if you aren’t running into this limitation using SQL queries as the source of a table is a bad idea because it makes maintenance difficult and you end up doing any expensive transformations at query time and paying the performance penalty over and over again. And yes, I know, some of you don’t have permissions to create views or get the data modelled correctly in the source and I know it’s tough, but that’s the way it is.

TSQL Table-Valued Functions And Dynamic M Parameters In Power BI DirectQuery

My favourite – and it seems many other people’s favourite – new feature in the February 2022 Power BI Desktop release is support for more datasources (including SQL Server, Azure SQL DB and Synapse) with dynamic M parameters. In my opinion dynamic M parameters are extremely important for anyone planning to use DirectQuery: they give you a lot more control over the SQL that is generated by Power BI and therefore give you a lot more control over query performance.

Teo Lachev has already stolen my thunder and blogged about how the new functionality allows you to use a TSQL stored procedure as the source of a table in DirectQuery mode. In this post I’m going to show you something very similar – but instead of using a stored procedure, I’m going to show a simple example of how to use a TSQL table-valued function, which I think has a slight advantage in terms of ease-of-use.

To start off I created a table-valued function in the Adventure Works 2017 sample database on my local SQL Server which returns some filtered data from the DimDate table:

CREATE FUNCTION [dbo].[udfDates] (
    @day_name varchar(50),
	@month_name varchar(50)
)
RETURNS TABLE
AS
RETURN
    SELECT 
        FullDateAlternateKey, EnglishDayNameOfWeek, EnglishMonthName, CalendarYear
    FROM
        DimDate
    WHERE
        EnglishDayNameOfWeek=@day_name
		and
		EnglishMonthName=@month_name;

Here’s how it can be called in a SQL SELECT statement:

select 
	FullDateAlternateKey,
    EnglishDayNameOfWeek,
    EnglishMonthName,
    CalendarYear
from 
	udfDates('Thursday', 'February')
where
	CalendarYear=2005

Next, I created a Power Query query in my dataset that called this function and returned a table in DirectQuery mode. The great thing about table-valued functions is that they appear in the Navigator pane when you connect to a SQL Server database from the Power Query Editor:

You can only hard-code the values you pass to the function’s parameters at this point but even if you do nothing here you can just return an empty table. After having done this I selected DirectQuery mode:

After this I created two Power Query text parameters, called DayName and MonthName, to hold the month and day names to be passed to the function:

I also created two Import mode queries called DayNames and MonthNames to hold all the valid values for the DayName and MonthName parameters:

The last thing to do in the Power Query Editor was to edit the query calling the function to pass the M parameters to it. Here’s the M code for the query after the modification made to the “Invoked Functiondbo_udfDates1” step:

let
  Source = Sql.Database(
    "localhost", 
    "AdventureWorksDW2017"
  ), 
  dbo_udfDates = Source
    {
      [
        Schema = "dbo", 
        Item   = "udfDates"
      ]
    }
    [Data], 
  #"Invoked Functiondbo_udfDates1"
    = dbo_udfDates(DayName, MonthName)
in
  #"Invoked Functiondbo_udfDates1"

At this point I closed the Power Query Editor and loaded the three tables to my dataset:

Next I bound the EnglishDayNameOfWeek column on the DayNames table to the DayName M parameter and the EnglishMonthName column on the MonthNames table to the MonthName M parameter:

Finally, I built a report with two slicers bound to the two dynamic M parameter columns and a table showing the output of the table-valued function:

Here’s an example of the TSQL generated by Power BI to populate the table in this report:

SELECT 
TOP (501) 
[t0].[FullDateAlternateKey],
[t0].[EnglishDayNameOfWeek],
[t0].[EnglishMonthName]
FROM 
(
(
select [$Table].[FullDateAlternateKey],
    [$Table].[EnglishDayNameOfWeek],
    [$Table].[EnglishMonthName],
    [$Table].[CalendarYear]
from [dbo].[udfDates]('Thursday', 'February') as [$Table]
)
)
 AS [t0]
WHERE 
(
[t0].[CalendarYear] = 2010
)

GROUP BY 
[t0].[FullDateAlternateKey],[t0].[EnglishDayNameOfWeek],
[t0].[EnglishMonthName],[t0].[CalendarYear]
ORDER BY [t0].[FullDateAlternateKey]
ASC
,[t0].[EnglishMonthName]
ASC
,[t0].[CalendarYear]
ASC
,[t0].[EnglishDayNameOfWeek]
ASC
 

I am by no means an expert in writing efficient TSQL so I can’t comment on the pros and cons of table-valued functions, stored procedures or using native SQL queries in Power BI (although the last of these has obvious maintainability issues). Hopefully, though, you can see the possibilities – and if you do get round to using this approach on a project, please let me know how you get on!

Power BI DirectQuery Query Labels For SQL Server-Related Sources

If you’re using Power BI in DirectQuery mode against a SQL Server-related data source (ie SQL Server on-prem, Azure SQL DB or Synapse) you may have noticed a new feature that was released a few weeks ago: query labels that allow you to link a SQL query to the Power BI visual that generated it.

There’s nothing you need to do to enable it – it happens automatically. Here’s an example of a TSQL query generated by a DirectQuery mode dataset with it in:

You can see that an OPTION clause in the query adds a label that contains three GUIDs that identify the dataset, report and visual that the query was generated for. These are the same GUIDs that you’ll see used for this purpose in other places, for example Log Analytics. This post has more details on how you can work out which datasets, reports and and visuals these GUIDs relate to.

As I said, this only works for SQL Server-related sources at the moment, but if you think this would be useful for other sources (and I’m sure you do…) please leave a comment below!

Setting SQL Server CONTEXT_INFO In Power Query

In my quest to check out every last bit of obscure Power Query functionality, this week I looked into the ContextInfo option on the Sql.Database and Sql.Databases M functions. This option allows you to set CONTEXT_INFO in SQL Server (see here for a good article explaining what this is) and here’s an example of how to use it:

let
  Source = Sql.Database(
    "localhost", 
    "AdventureWorksDW2017", 
    [
      Query = "SELECT * FROM DIMDATE", 
      ContextInfo = Text.ToBinary(
        "Hello"
      )
    ]
  )
in
  Source

This Power Query query runs a simple SQL SELECT statement against the SQL Server Adventure Works DW 2017 database. Note that since you need to pass a binary value to the ContextInfo option, in this example I had to use the Text.ToBinary function to convert my text to binary.

Here’s what happens in SQL Server when this Power Query query is run:

Here’s a simple example of how to retrieve this data on the SQL Server side:

SELECT session_id, login_time, program_name, context_info 
FROM sys.dm_exec_sessions
WHERE session_id=57

I’ll leave it to the SQL Server experts to decide what this can be used for and no doubt to complain that it would be more useful to support SESSION_CONTEXT too – although I’ve heard that might already be used for something, so I need to do more research here…